[R] Doing operations by grouping variable

ONKELINX, Thierry Thierry.ONKELINX at inbo.be
Thu Sep 23 11:40:44 CEST 2010


Another option for doing opertions by grouping variables is to the the
plyr package. 

d <- data.frame(x=1:10,
		g1=LETTERS[rep(11:12,each=5)],
		g2=letters[rep(21:23,c(3,3,4))]
)
library(plyr)
ddply(d, c("g1", "g2"), function(z){
	z$x <- z$x / max(z$x)
	z
})


------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek
team Biometrie & Kwaliteitszorg
Gaverstraat 4
9500 Geraardsbergen
Belgium

Research Institute for Nature and Forest
team Biometrics & Quality Assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium

tel. + 32 54/436 185
Thierry.Onkelinx op inbo.be
www.inbo.be

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey
  

> -----Oorspronkelijk bericht-----
> Van: r-help-bounces op r-project.org 
> [mailto:r-help-bounces op r-project.org] Namens William Dunlap
> Verzonden: woensdag 22 september 2010 19:51
> Aan: Seth W Bigelow; Bill.Venables op csiro.au
> CC: r-help op r-project.org
> Onderwerp: Re: [R] Doing operations by grouping variable
> 
> > -----Original Message-----
> > From: r-help-bounces op r-project.org
> > [mailto:r-help-bounces op r-project.org] On Behalf Of Seth W Bigelow
> > Sent: Tuesday, September 21, 2010 4:22 PM
> > To: Bill.Venables op csiro.au
> > Cc: r-help op r-project.org
> > Subject: Re: [R] Doing operations by grouping variable
> > 
> > Aah, that is the sort of truly elegant solution I have been 
> seeking. 
> > And it's wrapped up in a nice programming shortcut to boot 
> (i.e., the 
> > within statement). I retract anything I may have said about tapply 
> > being clunky.
> > 
> > Many thanks
> > 
> > --Seth
> > 
> > Dr. Seth  W. Bigelow
> > Biologist, USDA-FS Pacific Southwest Research Station
> > 1731 Research Park Drive, Davis California
> > 
> > <Bill.Venables op csiro.au>
> > 09/21/2010 03:15 PM
> > 
> > To <sbigelow op fs.fed.us>
> > 
> > You left out the subscript.  Why not just do
> > 
> > d <- within(data.frame(group = rep(1:5, each = 5), 
> >                        variable = rnorm(25)), 
> >             scaled <- variable/tapply(variable, group, max)[group])
> 
> This approach can be tricky when there is more than one 
> grouping variable.  E.g., suppose we have grouping variables
> g1 and g2:
>   > d <- data.frame(x=1:10,
>                   g1=LETTERS[rep(11:12,each=5)],
>                   g2=letters[rep(21:23,c(3,3,4))])
>   > d
>       x g1 g2
>   1   1  K  u
>   2   2  K  u
>   3   3  K  u
>   4   4  K  v
>   5   5  K  v
>   6   6  L  v
>   7   7  L  w
>   8   8  L  w
>   9   9  L  w
>   10 10  L  w
> and we want to divide each x value by it max for each
> g1*g2 group (6 possible groups, of which 4 are in the data).
> 
> You can extend Bill V.'s approach with
>   > with(d, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
>    [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
>    [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000 That 
> would fail if g1 and g2 were not factors but were integer 
> vectors.  Try it with
>   > di <- data.frame(x=1:10,
>                   g1=rep(11:12,each=5),
>                   g2=rep(21:23,c(3,3,4)))
>   > with(di, x/tapply(x, list(g1,g2), FUN=max)[cbind(g1,g2)])
>   Error in tapply(x, list(g1, g2), FUN = max)[cbind(g1, g2)] : 
>     subscript out of bounds
> 
> To avoid that problem you can call tapply with no FUN to get 
> the indices to subscript by
>   > with(d, x/tapply(x, list(g1,g2), FUN=max)[tapply(x,
> list(g1, g2))])
>    [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
>    [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
> 
> The misleadingly named ave() can avoid the need to do the 
> subscripting after tapply but has other problems
>   > with(d, x/ave(x, g1, g2, FUN=max))
>    [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
>    [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
>   Warning messages:
>   1: In FUN(X[[6L]], ...) : no non-missing arguments to max; 
> returning -Inf
>   2: In FUN(X[[6L]], ...) : no non-missing arguments to max; 
> returning -Inf It gives the right answer but it is calling 
> FUN even for the empty interaction groups.  For some FUN's this would
> abort the call, not just give a warning.   In any case it
> is a waste of time.
> 
> In either case you can also use the interaction() function to 
> change the multiple grouping vectors into one:
>   > d <- within(d, interaction(g1, g2, drop=TRUE))
>   > with(d, x/ave(x, g1g2, FUN=max))
>    [1] 0.3333333 0.6666667 1.0000000 0.8000000 1.0000000
>    [6] 1.0000000 0.7000000 0.8000000 0.9000000 1.0000000
>   > with(d, x/tapply(x, g1g2, FUN=max)[g1g2])
>         K.u       K.u       K.u       K.v       K.v       L.v 
>   0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000 
>         L.w       L.w       L.w       L.w 
>   0.7000000 0.8000000 0.9000000 1.0000000
>   > with(d, x/tapply(x, g1g2, FUN=max)[tapply(x, g1g2)])
>         K.u       K.u       K.u       K.v       K.v       L.v 
>   0.3333333 0.6666667 1.0000000 0.8000000 1.0000000 1.0000000 
>         L.w       L.w       L.w       L.w 
>   0.7000000 0.8000000 0.9000000 1.0000000 The names are 
> probably unwanted in the tapply cases; use unname to get rid of them.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com  
> 
> > and be done with it?
> > 
> > (Warning: if you replace the second '<-' above by '=', it will not 
> > work.
> > It is NOT true that you can always replace '<-' by '=' for 
> assignment.
> > Why?)
> > 
> > Bill Venables.
> 
> ______________________________________________
> R-help op r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

Druk dit bericht a.u.b. niet onnodig af.
Please do not print this message unnecessarily.

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.



More information about the R-help mailing list