[R] aggregate / collapse big data frame efficiently

Tue Dec 25 18:08:41 CET 2012

Hi,
You could use library(data.table) 
x <- data.frame(A=rep(letters,2), B=rnorm(52), C=rnorm(52), D=rnorm(52))
res<- with(x,aggregate(cbind(B,C,D),by=list(A),mean))
colnames(res)[1]<-"A"

 x1<-data.table(x)
res2<- x1[,list(B=mean(B),C=mean(C),D=mean(D)),by=A]
 identical(res,data.frame(res2))
#[1] TRUE

Just for comparison:
set.seed(25)
xnew<-data.frame(A=rep(letters,1500),B=rnorm(39000),C=rnorm(39000),D=rnorm(39000))
system.time(resnew<-with(xnew,aggregate(cbind(B,C,D),by=list(A),mean)))
 #user  system elapsed 
 # 0.152   0.000   0.152 

xnew1<-data.table(xnew)
system.time(resnew1<- xnew1[,list(B=mean(B),C=mean(C),D=mean(D)),by=A])
# user  system elapsed 
 # 0.004   0.000   0.005 

A.K.

----- Original Message -----
From: Martin Batholdy <batholdy at googlemail.com>
To: "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Tuesday, December 25, 2012 11:34 AM
Subject: [R] aggregate / collapse big data frame efficiently

Hi,

I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable;

here is the sample code:

x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))

aggregate(x, list(x[,1]), mean)

Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) – and it takes very very long (actually at some point I just stopped it).

Is there anything that can be done to make the aggregate routine more efficient?
Or is there a different approach that would work faster?

Thanks for any suggestions!

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.