[R-SIG-Mac] multicore package: collecting results

Thu Jun 30 13:28:16 CEST 2011

Thanks for this, it's now dead fast, as one could conceivably expect.
Simon's solution is astonishingly fast, however I had to reconstruct the factors and their levels which were (expectedly) lost during the c() operation. Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 million rows data frame it is still 2x faster than the elegant one line solution.

Some figures of performance:

> t <- proc.time()
> dl <- mclapply(lsessions, mcfun, mc.cores=cores)
> print(proc.time()-t)
utilisateur     système      écoulé 
    171.894      47.696      28.713

> l <- dl
> all =  lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]])))
> names(all) = names(l[[1]])
> #attr(all, "row.names") = seq.int(all[[1]])
> attr(all, "row.names") = c(NA, -length(all[[1]]))
> class(all) = "data.frame"
utilisateur     système      écoulé 
      0.412       0.280       0.708 

> all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
...
utilisateur     système      écoulé 
      4.852       2.349       7.038

> my_df = do.call(rbind, dl)
utilisateur     système      écoulé 
      9.791       5.411      15.039 

Thanks to both of you!

Vincent

Le 29 juin 2011 à 21:48, Simon Urbanek a écrit :

> 
> On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
> 
>> Is the slowdown happening while mclapply runs or while you're doing
>> the rbind? If the latter, I wonder if the code below is more efficient
>> than using rbind inside a loop:
>> 
>> my_df = do.call( rbind , my_list_from_mclapply )
>> 
> 
> Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like
> 
> l = my_list_from_mclapply
> all =  lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]])))
> names(all) = names(l[[1]])
> attr(all, "row.names") = c(NA, -length(all[[1]]))
> class(all) = "data.frame"
> 
> Again, make sure all the assumptions above are satisfied before using.
> 
> Cheers,
> Simon
> 
> 
> 
>> 
>> 
>> On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
>>> Hi all,
>>> 
>>> I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great.
>>> 
>>> But when I want to collect all processed elements of the returned list into one big data frame it takes ages.
>>> 
>>> The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster...
>>> 
>>> Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently?
>>> 
>>> Thanks,
>>> Vincent
>>> 
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> R-SIG-Mac at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>> 
>> 
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>> 
>> 
>