[R] SUM,COUNT,AVG

hadley wickham h.wickham at gmail.com
Mon Apr 6 16:56:05 CEST 2009


On Mon, Apr 6, 2009 at 9:34 AM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:
> There are various ways to do this in R.
>
> # sample data
> dd <- data.frame(a=1:10,b=sample(3,10,replace=T),c=sample(3,10,replace=T))
>
> Using the standard built-in functions, you can use:
>
> *** aggregate ***
>
> aggregate(dd,list(b=dd$b,c=dd$c),sum)
>  b c  a b c
> 1 1 1 10 2 2
> 2 2 1  3 2 1
> ....
>
> *** tapply ***
>
> tapply(dd$a,interaction(dd$b,dd$c),sum)
>      1.1       2.1       3.1       1.2       2.2       3.2       1.3
> 2.3
>  5.000000  3.000000 10.000000  5.000000        NA        NA  5.000000
> ...
>
> But the nicest way is probably to use the plyr package:
>
>> library(plyr)
>> ddply(dd,~b+c,sum)
>  b c V1
> 1 1 1 14
> 2 2 1  6
> ....
>
> ********
>
> Unfortunately, none of these approaches allows you do return more than one
> result from the function, so you'll need to write
>
>> ddply(dd,~b+c,length)   # count
>> ddply(dd,~b+c,sum)
>> ddply(dd,~b+c,mean)   # arithmetic average
>
> There is an 'each' function in plyr, but it doesn't seem to be compatible
> with ddply.

That's because ddply applies the function to the whole data frame, not
just the columns that aren't participating in the split.  One way
around it is:

ddply(dd, ~ b + c, function(df) each(length, sum, mean)(df$a))

I haven't figured out a more elegant way to specify this yet.

Hadley

-- 
http://had.co.nz/



More information about the R-help mailing list