[R] Doing operations by grouping variable

Wed Sep 22 19:50:38 CEST 2010

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Seth W Bigelow
> Sent: Tuesday, September 21, 2010 4:22 PM
> To: Bill.Venables at csiro.au
> Cc: r-help at r-project.org
> Subject: Re: [R] Doing operations by grouping variable
> 
> Aah, that is the sort of truly elegant solution I have been 
> seeking. And 
> it's wrapped up in a nice programming shortcut to boot (i.e., 
> the within 
> statement). I retract anything I may have said about tapply 
> being clunky.
> 
> Many thanks
> 
> --Seth
> 
> Dr. Seth  W. Bigelow
> Biologist, USDA-FS Pacific Southwest Research Station
> 1731 Research Park Drive, Davis California
> 
> <Bill.Venables at csiro.au> 
> 09/21/2010 03:15 PM
> 
> To <sbigelow at fs.fed.us>
> 
> You left out the subscript.  Why not just do
> 
> d <- within(data.frame(group = rep(1:5, each = 5), 
>                        variable = rnorm(25)), 
>             scaled <- variable/tapply(variable, group, max)[group])

This approach can be tricky when there is more than one
grouping variable.  E.g., suppose we have grouping variables
g1 and g2:
  > d <- data.frame(x=1:10,
                  g1=LETTERS[rep(11:12,each=5)],
                  g2=letters[rep(21:23,c(3,3,4))]) 
  > d
      x g1 g2
  1   1  K  u
  2   2  K  u
  3   3  K  u
  4   4  K  v
  5   5  K  v
  6   6  L  v
  7   7  L  w
  8   8  L  w
  9   9  L  w
  10 10  L  w
and we want to divide each x value by it max for each
g1*g2 group (6 possible groups, of which 4 are in the
data).

You can extend Bill V.'s approach with
  > with(d, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
   [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
   [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
That would fail if g1 and g2 were not factors but were
integer vectors.  Try it with
  > di <- data.frame(x=1:10,
                  g1=rep(11:12,each=5),
                  g2=rep(21:23,c(3,3,4))) 
  > with(di, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
  Error in tapply(x, list(g1, g2), FUN = max)[cbind(g1, g2)] : 
    subscript out of bounds

To avoid that problem you can call tapply with no FUN
to get the indices to subscript by
  > with(d, x/tapply(x, list(g1,g2), FUN=max)[tapply(x, list(g1, g2))])
   [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
   [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000

The misleadingly named ave() can avoid the need to do the
subscripting after tapply but has other problems
  > with(d, x/ave(x, g1, g2, FUN=max))
   [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
   [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
  Warning messages:
  1: In FUN(X[[6L]], ...) : no non-missing arguments to max; returning
-Inf
  2: In FUN(X[[6L]], ...) : no non-missing arguments to max; returning
-Inf
It gives the right answer but it is calling FUN even for
the empty interaction groups.  For some FUN's this would
abort the call, not just give a warning.   In any case it
is a waste of time.

In either case you can also use the interaction() function to
change the multiple grouping vectors into one:
  > d <- within(d, interaction(g1, g2, drop=TRUE))
  > with(d, x/ave(x, g1g2, FUN=max))
   [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
   [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
  > with(d, x/tapply(x, g1g2, FUN=max)[g1g2])
        K.u       K.u       K.u       K.v       K.v       L.v 
  0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000 
        L.w       L.w       L.w       L.w 
  0.7000000 0.8000000 0.9000000 1.0000000
  > with(d, x/tapply(x, g1g2, FUN=max)[tapply(x, g1g2)])
        K.u       K.u       K.u       K.v       K.v       L.v 
  0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000 
        L.w       L.w       L.w       L.w 
  0.7000000 0.8000000 0.9000000 1.0000000 
The names are probably unwanted in the tapply cases; use unname
to get rid of them.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> and be done with it?
> 
> (Warning: if you replace the second '<-' above by '=', it 
> will not work. 
> It is NOT true that you can always replace '<-' by '=' for 
> assignment. 
> Why?)
> 
> Bill Venables.