[R] aggregate with cumsum
Bond, Stephen
Stephen.Bond at cibc.com
Mon Oct 18 15:55:40 CEST 2010
Gabor,
You are suggesting some very advanced usage that I do not understand, but it seems this is not what I meant when I said loop.
I have a df with 47k rows and each of these is fed to a 'predict' which will output about 62 rows, so the number of groups is very large and I implied that I would go through the 47k x 62 rows with
For (jj in (set of 47k values)) # tmp.df=big.df[big.df$group==jj,] to subset
# and then sum
Which is very slow. I discovered that even creating the dataset is super slow as I use write.table
The clogging comes from
write.table(tmp,"predcom.csv",row.names=FALSE,col.names=FALSE,append=TRUE,sep=',')
Can anybody suggest a faster way of appending to a text file??
All comments are appreciated.
Stephen B
-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
Sent: Tuesday, October 12, 2010 4:16 PM
To: Bond, Stephen
Cc: r-help at r-project.org
Subject: Re: [R] aggregate with cumsum
On Tue, Oct 12, 2010 at 1:40 PM, Bond, Stephen <Stephen.Bond at cibc.com> wrote:
> Hello everybody,
>
> Data is
> myd <- data.frame(id1=rep(c("a","b","c"),each=3),id2=rep(1:3,3),val=rnorm(9))
>
> I want to get a cumulative sum over each of id1. trying aggregate does not work
>
> myd$pcum <- aggregate(myd[,c("val")],list(orig=myd$id1),cumsum)
>
> Please suggest a solution. In real the dataframe is huge so looping with for and subsetting is not a great idea (still doable, though).
Looping can be slow but its not necessarily so. Here are three
approaches to using ave with cumsum to solve this problem. The
benchmark shows that the loop is actually the fastest:
N <- 1e4
k <- 10
myd <- data.frame(id1=rep(letters[1:k],each=N),id2=rep(1:k,N),val=rnorm(k*N))
library(rbenchmark)
benchmark(order = "relative", replications = 100,
loop = { loop <- myd
for(i in 2:3) loop[, i] <- ave(myd[, i], myd[, 1], FUN = cumsum)
},
nonloop1 = { nonloop1 <- transform(myd,
id2 = ave(id2, id1, FUN = cumsum),
val = ave(val, id1, FUN = cumsum)
)},
nonloop2 = {
f <- function(i) ave(myd[, i], myd[, 1], FUN = cumsum)
nonloop2 <- replace(myd, 2:3, lapply(2:3, f))
}
)
identical(loop, nonloop1)
identical(loop, nonloop2)
The output on my laptop is:
test replications elapsed relative user.self sys.self user.child sys.child
1 loop 100 8.52 1.000000 8.07 0.10 NA NA
3 nonloop2 100 8.94 1.049296 8.29 0.17 NA NA
2 nonloop1 100 11.65 1.367371 10.71 0.22 NA NA
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list