[R] Doing operations by grouping variable

Thu Sep 23 01:15:53 CEST 2010

Thanks for that terrific post Bill - I learnt a lot.

One question, should this line

d <- within(d, interaction(g1, g2, drop=TRUE))

be this ?

d <- within(d, g1g2 <- interaction(g1, g2, drop=TRUE))

Michael

On 23 September 2010 03:50, William Dunlap <wdunlap at tibco.com> wrote:
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Seth W Bigelow
>> Sent: Tuesday, September 21, 2010 4:22 PM
>> To: Bill.Venables at csiro.au
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Doing operations by grouping variable
>>
>> Aah, that is the sort of truly elegant solution I have been
>> seeking. And
>> it's wrapped up in a nice programming shortcut to boot (i.e.,
>> the within
>> statement). I retract anything I may have said about tapply
>> being clunky.
>>
>> Many thanks
>>
>> --Seth
>>
>> Dr. Seth  W. Bigelow
>> Biologist, USDA-FS Pacific Southwest Research Station
>> 1731 Research Park Drive, Davis California
>>
>> <Bill.Venables at csiro.au>
>> 09/21/2010 03:15 PM
>>
>> To <sbigelow at fs.fed.us>
>>
>> You left out the subscript.  Why not just do
>>
>> d <- within(data.frame(group = rep(1:5, each = 5),
>>                        variable = rnorm(25)),
>>             scaled <- variable/tapply(variable, group, max)[group])
>
> This approach can be tricky when there is more than one
> grouping variable.  E.g., suppose we have grouping variables
> g1 and g2:
>  > d <- data.frame(x=1:10,
>                  g1=LETTERS[rep(11:12,each=5)],
>                  g2=letters[rep(21:23,c(3,3,4))])
>  > d
>      x g1 g2
>  1   1  K  u
>  2   2  K  u
>  3   3  K  u
>  4   4  K  v
>  5   5  K  v
>  6   6  L  v
>  7   7  L  w
>  8   8  L  w
>  9   9  L  w
>  10 10  L  w
> and we want to divide each x value by it max for each
> g1*g2 group (6 possible groups, of which 4 are in the
> data).
>
> You can extend Bill V.'s approach with
>  > with(d, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
>   [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
>   [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
> That would fail if g1 and g2 were not factors but were
> integer vectors.  Try it with
>  > di <- data.frame(x=1:10,
>                  g1=rep(11:12,each=5),
>                  g2=rep(21:23,c(3,3,4)))
>  > with(di, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
>  Error in tapply(x, list(g1, g2), FUN = max)[cbind(g1, g2)] :
>    subscript out of bounds
>
> To avoid that problem you can call tapply with no FUN
> to get the indices to subscript by
>  > with(d, x/tapply(x, list(g1,g2), FUN=max)[tapply(x, list(g1, g2))])
>   [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
>   [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
>
> The misleadingly named ave() can avoid the need to do the
> subscripting after tapply but has other problems
>  > with(d, x/ave(x, g1, g2, FUN=max))
>   [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
>   [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
>  Warning messages:
>  1: In FUN(X[[6L]], ...) : no non-missing arguments to max; returning
> -Inf
>  2: In FUN(X[[6L]], ...) : no non-missing arguments to max; returning
> -Inf
> It gives the right answer but it is calling FUN even for
> the empty interaction groups.  For some FUN's this would
> abort the call, not just give a warning.   In any case it
> is a waste of time.
>
> In either case you can also use the interaction() function to
> change the multiple grouping vectors into one:
>  > d <- within(d, interaction(g1, g2, drop=TRUE))
>  > with(d, x/ave(x, g1g2, FUN=max))
>   [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
>   [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
>  > with(d, x/tapply(x, g1g2, FUN=max)[g1g2])
>        K.u       K.u       K.u       K.v       K.v       L.v
>  0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000
>        L.w       L.w       L.w       L.w
>  0.7000000 0.8000000 0.9000000 1.0000000
>  > with(d, x/tapply(x, g1g2, FUN=max)[tapply(x, g1g2)])
>        K.u       K.u       K.u       K.v       K.v       L.v
>  0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000
>        L.w       L.w       L.w       L.w
>  0.7000000 0.8000000 0.9000000 1.0000000
> The names are probably unwanted in the tapply cases; use unname
> to get rid of them.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>