[R] Group by a data frame with multiple columns
arun
smartpink111 at yahoo.com
Sun Aug 4 06:07:32 CEST 2013
Hi,
May be you should try ?data.table().
Please use ?dput().
dat1<- read.table(text="
Area Sex Year y
Bob F 2011 1
Bob F 2011 2
Bob F 2012 3
Bob M 2012 3
Bob M 2012 2
Fred F 2011 1
Fred F 2011 1
Fred F 2012 2
Fred M 2012 3
Fred M 2012 1
",sep="",header=TRUE,stringsAsFactors=FALSE)
library(data.table)
dt2<-dt1[,sum(y),by=list(Area,Sex,Year)]
dt2
# Area Sex Year V1
#1: Bob F 2011 3
#2: Bob F 2012 3
#3: Bob M 2012 5
#4: Fred F 2011 2
#5: Fred F 2012 2
#6: Fred M 2012 4
#Speed
set.seed(28)
dat2<- data.frame(Area=sample(LETTERS,1e7,replace=TRUE),Sex=sample(c("F","M"),1e7,replace=TRUE),Year=sample(2005:2012,1e7,replace=TRUE),y=sample(1:10,1e7,replace=TRUE))
system.time(datTest<- aggregate(y~.,data=dat2,sum))
# user system elapsed
# 18.056 1.336 19.424
datTest2<- datTest[order(datTest$Area,datTest$Sex,datTest$Year),]
row.names(datTest2)<- 1:nrow(datTest2)
dtTest<- data.table(dat2)
system.time({
setkey(dtTest,Area,Sex,Year)
dtTest2<- dtTest[,sum(y),by=list(Area,Sex,Year)]})
# user system elapsed
# 1.232 0.184 1.418
setnames(dtTest2,"V1","y")
identical(datTest2,as.data.frame(dtTest2))
#[1] TRUE
A.K.
----- Original Message -----
From: Michael Liaw <michael.liaw at hotmail.com>
To: r-help at r-project.org
Cc:
Sent: Saturday, August 3, 2013 8:11 PM
Subject: [R] Group by a data frame with multiple columns
Hi
I'm trying to manipulate a data frame (that has about 10 million rows) rows
by "grouping" it with multiple columns. For example, say the data set looks
like:
Area
Sex
Year
y
Bob
F
2011
1
Bob
F
2011
2
Bob
F
2012
3
Bob
M
2012
3
Bob
M
2012
2
Fred
F
2011
1
Fred
F
2011
1
Fred
F
2012
2
Fred
M
2012
3
Fred
M
2012
1
And I want it to look like
Area
Sex
Year
Sum of y
Bob
F
2011
3
Bob
F
2012
3
Bob
M
2012
5
Fred
F
2011
2
Fred
F
2012
2
Fred
M
2012
4
I think I can use something like:
tmp <- aggregate (y ~ ., sum)
But due to the size it's really taking a strain on the computer (even with
64-bit R on a, yes unfortunately Windows, machine with 16GB RAM :(). The
reason for me wanting the data set to get into this form is I want to then
apply the population information and get the "rate" on the "sum of y" column
then fit a Poisson regression model.
I'm wondering (and would appreciate comments) whether there is a more
efficient way to the process I described?
Cheers
Michael
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list