[Rd] combining large list of data.frames
Cole Beck
cole.beck at Vanderbilt.Edu
Fri Apr 20 00:34:29 CEST 2012
It's normal for me to create a list of data.frames and then use
do.call('rbind', list(...)) to create a single data.frame. However,
I've noticed as the size of the list grows large, it is perhaps better
to do this in chunks. As an example here's a list of 20,000 similar
data.frames.
# create list of data.frames
dat <- vector("list", 20000)
for(i in seq_along(dat)) {
size <- sample(1:30, 1)
dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size),
letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE),
size, replace=TRUE))
}
# combine into one data.frame, normal usage
# system.time(do.call('rbind', dat)) # takes 2-3 minutes
combine <- function(x, steps=NA, verbose=FALSE) {
nr <- length(x)
if(is.na(steps)) steps <- nr
while(nr %% steps != 0) steps <- steps+1
if(verbose) cat(sprintf("step size: %s\r\n", steps))
dl <- vector("list", steps)
for(i in seq(steps)) {
ix <- seq(from=(i-1)*nr/steps+1, length.out=nr/steps)
dl[[i]] <- do.call("rbind", x[ix])
}
do.call("rbind", dl)
}
# combine into one data.frame
system.time(combine(dat, 100)) # takes 5-10 seconds
I'm very surprised by this result. Does this improvement seem
reasonable? I would think "do.call" could utilize something similar by
default when the length of "args" is too high. Is using "do.call" not
recommended in this scenario?
Regards,
Cole Beck
More information about the R-devel
mailing list