[R] How to do aggregate operations with non-scalar functions
Itay Furman
itayf at u.washington.edu
Thu Apr 7 07:18:34 CEST 2005
On Tue, 5 Apr 2005, Gabor Grothendieck wrote:
> On Apr 5, 2005 6:59 PM, Itay Furman <itayf at u.washington.edu> wrote:
>>
>> Hi,
>>
>> I have a data set, the structure of which is something like this:
>>
>>> a <- rep(c("a", "b"), c(6,6))
>>> x <- rep(c("x", "y", "z"), c(4,4,4))
>>> df <- data.frame(a=a, x=x, r=rnorm(12))
>>
>> The true data set has >1 million rows. The factors "a" and "x"
>> have about 70 levels each; combined together they subset 'df'
>> into ~900 data frames.
>> For each such subset I'd like to compute various statistics
>> including quantiles, but I can't find an efficient way of
[snip]
>> I would like to end up with a data frame like this:
>>
>> a x 0% 25%
>> 1 a x -0.7727268 0.1693188
>> 2 a y -0.3410671 0.1566322
>> 3 b y -0.2914710 -0.2677410
>> 4 b z -0.8502875 -0.6505710
[snip]
> One can use
>
> do.call("rbind", by(df, list(a = a, x = x), f))
>
> where f is the appropriate function.
>
> In this case f can be described in terms of df.quantile which
> is like quantile except it returns a one row data frame:
>
> df.quantile <- function(x,p)
> as.data.frame(t(data.matrix(quantile(x, p))))
>
> f <- function(df, p = c(0.25, 0.5))
> cbind(df[1,1:2], df.quantile(df[,"r"], p))
>
Thanks! Just what I wanted.
A minor point is that for some reason the row numbers in the
final data frame are not sequential (see below -- this is not a
consequence of my changes).
Actually, seeing your code I became greedy and decided to
extract more summary statistics in one blow like this:
df.summary <- function(x, qtils=(0:4)/4)
cbind(data.frame(mean=mean(x), var=var(x),
length=length(x)),
as.data.frame(t(data.matrix(quantile(x, qtils)))))
f <- function(x, qtils=(0:4)/4)
cbind(x[1,1:2], df.summary(x[,"r"], qtils))
> do.call("rbind", by(df, list(a = a, x = x), f))
a x mean var length 0% 25% 50%
1 a x 0.2901207 0.522191469 4 -0.7727268 0.1693188 0.5523356
5 a y 0.6543314 1.981636402 2 -0.3410671 0.1566322 0.6543314
7 b y -0.2440109 0.004504928 2 -0.2914710 -0.2677410 -0.2440109
9 b z 0.4523763 1.841469995 4 -0.8502875 -0.6505710 0.4717093
75% 100%
1 0.6731375 0.8285385
5 1.1520307 1.6497299
7 -0.2202808 -0.1965508
9 1.5746565 1.7163741
What remains a puzzle to me is why R has a native subsetting
function that returns a scalar per subset [aggregate()], another
one that returns a list [by()], but no function that is able to
return a vector per subset. Is there a less demand to such
operation (like extracting summary statistics in one blow)? Is
it less general? Or technically more difficult to achieve?
I'm just curious.
Itay
----------------------------------------------------------------
itayf at u.washington.edu / +1 (206) 543 9040 / U of Washington
More information about the R-help
mailing list