[Rd] combining large list of data.frames

Cole Beck cole.beck at Vanderbilt.Edu
Fri Apr 20 00:34:29 CEST 2012


It's normal for me to create a list of data.frames and then use 
do.call('rbind', list(...)) to create a single data.frame.  However, 
I've noticed as the size of the list grows large, it is perhaps better 
to do this in chunks.  As an example here's a list of 20,000 similar 
data.frames.

# create list of data.frames
dat <- vector("list", 20000)
for(i in seq_along(dat)) {
   size <- sample(1:30, 1)
   dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), 
letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), 
size, replace=TRUE))
}
# combine into one data.frame, normal usage
# system.time(do.call('rbind', dat)) # takes 2-3 minutes
combine <- function(x, steps=NA, verbose=FALSE) {
   nr <- length(x)
   if(is.na(steps)) steps <- nr
   while(nr %% steps != 0) steps <- steps+1
   if(verbose) cat(sprintf("step size: %s\r\n", steps))
   dl <- vector("list", steps)
   for(i in seq(steps)) {
     ix <- seq(from=(i-1)*nr/steps+1, length.out=nr/steps)
     dl[[i]] <- do.call("rbind", x[ix])
   }
   do.call("rbind", dl)
}
# combine into one data.frame
system.time(combine(dat, 100)) # takes 5-10 seconds

I'm very surprised by this result.  Does this improvement seem 
reasonable?  I would think "do.call" could utilize something similar by 
default when the length of "args" is too high.  Is using "do.call" not 
recommended in this scenario?

Regards,
Cole Beck



More information about the R-devel mailing list