[R] Doing operations by grouping variable
William Dunlap
wdunlap at tibco.com
Wed Sep 22 19:50:38 CEST 2010
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Seth W Bigelow
> Sent: Tuesday, September 21, 2010 4:22 PM
> To: Bill.Venables at csiro.au
> Cc: r-help at r-project.org
> Subject: Re: [R] Doing operations by grouping variable
>
> Aah, that is the sort of truly elegant solution I have been
> seeking. And
> it's wrapped up in a nice programming shortcut to boot (i.e.,
> the within
> statement). I retract anything I may have said about tapply
> being clunky.
>
> Many thanks
>
> --Seth
>
> Dr. Seth W. Bigelow
> Biologist, USDA-FS Pacific Southwest Research Station
> 1731 Research Park Drive, Davis California
>
> <Bill.Venables at csiro.au>
> 09/21/2010 03:15 PM
>
> To <sbigelow at fs.fed.us>
>
> You left out the subscript. Why not just do
>
> d <- within(data.frame(group = rep(1:5, each = 5),
> variable = rnorm(25)),
> scaled <- variable/tapply(variable, group, max)[group])
This approach can be tricky when there is more than one
grouping variable. E.g., suppose we have grouping variables
g1 and g2:
> d <- data.frame(x=1:10,
g1=LETTERS[rep(11:12,each=5)],
g2=letters[rep(21:23,c(3,3,4))])
> d
x g1 g2
1 1 K u
2 2 K u
3 3 K u
4 4 K v
5 5 K v
6 6 L v
7 7 L w
8 8 L w
9 9 L w
10 10 L w
and we want to divide each x value by it max for each
g1*g2 group (6 possible groups, of which 4 are in the
data).
You can extend Bill V.'s approach with
> with(d, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
[1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
[6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
That would fail if g1 and g2 were not factors but were
integer vectors. Try it with
> di <- data.frame(x=1:10,
g1=rep(11:12,each=5),
g2=rep(21:23,c(3,3,4)))
> with(di, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
Error in tapply(x, list(g1, g2), FUN = max)[cbind(g1, g2)] :
subscript out of bounds
To avoid that problem you can call tapply with no FUN
to get the indices to subscript by
> with(d, x/tapply(x, list(g1,g2), FUN=max)[tapply(x, list(g1, g2))])
[1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
[6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
The misleadingly named ave() can avoid the need to do the
subscripting after tapply but has other problems
> with(d, x/ave(x, g1, g2, FUN=max))
[1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
[6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
Warning messages:
1: In FUN(X[[6L]], ...) : no non-missing arguments to max; returning
-Inf
2: In FUN(X[[6L]], ...) : no non-missing arguments to max; returning
-Inf
It gives the right answer but it is calling FUN even for
the empty interaction groups. For some FUN's this would
abort the call, not just give a warning. In any case it
is a waste of time.
In either case you can also use the interaction() function to
change the multiple grouping vectors into one:
> d <- within(d, interaction(g1, g2, drop=TRUE))
> with(d, x/ave(x, g1g2, FUN=max))
[1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
[6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
> with(d, x/tapply(x, g1g2, FUN=max)[g1g2])
K.u K.u K.u K.v K.v L.v
0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000
L.w L.w L.w L.w
0.7000000 0.8000000 0.9000000 1.0000000
> with(d, x/tapply(x, g1g2, FUN=max)[tapply(x, g1g2)])
K.u K.u K.u K.v K.v L.v
0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000
L.w L.w L.w L.w
0.7000000 0.8000000 0.9000000 1.0000000
The names are probably unwanted in the tapply cases; use unname
to get rid of them.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> and be done with it?
>
> (Warning: if you replace the second '<-' above by '=', it
> will not work.
> It is NOT true that you can always replace '<-' by '=' for
> assignment.
> Why?)
>
> Bill Venables.
More information about the R-help
mailing list