[R-SIG-Mac] multicore package: collecting results

Wed Jun 29 21:48:36 CEST 2011

On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:

> Is the slowdown happening while mclapply runs or while you're doing
> the rbind? If the latter, I wonder if the code below is more efficient
> than using rbind inside a loop:
> 
> my_df = do.call( rbind , my_list_from_mclapply )
> 

Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like

l = my_list_from_mclapply
all =  lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]])))
names(all) = names(l[[1]])
attr(all, "row.names") = c(NA, -length(all[[1]]))
class(all) = "data.frame"

Again, make sure all the assumptions above are satisfied before using.

Cheers,
Simon

> 
> 
> On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
>> Hi all,
>> 
>> I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great.
>> 
>> But when I want to collect all processed elements of the returned list into one big data frame it takes ages.
>> 
>> The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster...
>> 
>> Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently?
>> 
>> Thanks,
>> Vincent
>> 
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>> 
> 
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
> 
>