[R] Doing operations by grouping variable
Michael Bedward
michael.bedward at gmail.com
Thu Sep 23 01:15:53 CEST 2010
Thanks for that terrific post Bill - I learnt a lot.
One question, should this line
d <- within(d, interaction(g1, g2, drop=TRUE))
be this ?
d <- within(d, g1g2 <- interaction(g1, g2, drop=TRUE))
Michael
On 23 September 2010 03:50, William Dunlap <wdunlap at tibco.com> wrote:
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Seth W Bigelow
>> Sent: Tuesday, September 21, 2010 4:22 PM
>> To: Bill.Venables at csiro.au
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Doing operations by grouping variable
>>
>> Aah, that is the sort of truly elegant solution I have been
>> seeking. And
>> it's wrapped up in a nice programming shortcut to boot (i.e.,
>> the within
>> statement). I retract anything I may have said about tapply
>> being clunky.
>>
>> Many thanks
>>
>> --Seth
>>
>> Dr. Seth W. Bigelow
>> Biologist, USDA-FS Pacific Southwest Research Station
>> 1731 Research Park Drive, Davis California
>>
>> <Bill.Venables at csiro.au>
>> 09/21/2010 03:15 PM
>>
>> To <sbigelow at fs.fed.us>
>>
>> You left out the subscript. Why not just do
>>
>> d <- within(data.frame(group = rep(1:5, each = 5),
>> variable = rnorm(25)),
>> scaled <- variable/tapply(variable, group, max)[group])
>
> This approach can be tricky when there is more than one
> grouping variable. E.g., suppose we have grouping variables
> g1 and g2:
> > d <- data.frame(x=1:10,
> g1=LETTERS[rep(11:12,each=5)],
> g2=letters[rep(21:23,c(3,3,4))])
> > d
> x g1 g2
> 1 1 K u
> 2 2 K u
> 3 3 K u
> 4 4 K v
> 5 5 K v
> 6 6 L v
> 7 7 L w
> 8 8 L w
> 9 9 L w
> 10 10 L w
> and we want to divide each x value by it max for each
> g1*g2 group (6 possible groups, of which 4 are in the
> data).
>
> You can extend Bill V.'s approach with
> > with(d, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
> [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
> [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
> That would fail if g1 and g2 were not factors but were
> integer vectors. Try it with
> > di <- data.frame(x=1:10,
> g1=rep(11:12,each=5),
> g2=rep(21:23,c(3,3,4)))
> > with(di, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
> Error in tapply(x, list(g1, g2), FUN = max)[cbind(g1, g2)] :
> subscript out of bounds
>
> To avoid that problem you can call tapply with no FUN
> to get the indices to subscript by
> > with(d, x/tapply(x, list(g1,g2), FUN=max)[tapply(x, list(g1, g2))])
> [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
> [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
>
> The misleadingly named ave() can avoid the need to do the
> subscripting after tapply but has other problems
> > with(d, x/ave(x, g1, g2, FUN=max))
> [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
> [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
> Warning messages:
> 1: In FUN(X[[6L]], ...) : no non-missing arguments to max; returning
> -Inf
> 2: In FUN(X[[6L]], ...) : no non-missing arguments to max; returning
> -Inf
> It gives the right answer but it is calling FUN even for
> the empty interaction groups. For some FUN's this would
> abort the call, not just give a warning. In any case it
> is a waste of time.
>
> In either case you can also use the interaction() function to
> change the multiple grouping vectors into one:
> > d <- within(d, interaction(g1, g2, drop=TRUE))
> > with(d, x/ave(x, g1g2, FUN=max))
> [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
> [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
> > with(d, x/tapply(x, g1g2, FUN=max)[g1g2])
> K.u K.u K.u K.v K.v L.v
> 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000
> L.w L.w L.w L.w
> 0.7000000 0.8000000 0.9000000 1.0000000
> > with(d, x/tapply(x, g1g2, FUN=max)[tapply(x, g1g2)])
> K.u K.u K.u K.v K.v L.v
> 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000
> L.w L.w L.w L.w
> 0.7000000 0.8000000 0.9000000 1.0000000
> The names are probably unwanted in the tapply cases; use unname
> to get rid of them.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
More information about the R-help
mailing list