[R-SIG-Mac] multicore package: collecting results
Vincent Aubanel
v.aubanel at laslab.org
Thu Jun 30 17:19:06 CEST 2011
Le 30 juin 2011 à 15:36, Simon Urbanek a écrit :
>
> On Jun 30, 2011, at 7:28 AM, Vincent Aubanel wrote:
>
>> Thanks for this, it's now dead fast, as one could conceivably expect.
>> Simon's solution is astonishingly fast, however I had to reconstruct the factors and their levels which were (expectedly) lost during the c() operation.
>
>
> One way to avoid it is to use as.character() on factors inside the parallel function, so the pieces don't have factors. You can create a factor at the end and it should be faster, because factor() calls as.character() anyway so it will be a no-op by that point.
It is faster, thanks! Slightly for the parallel loop (because of removal of unnecessary as.character() operations) and down to about 3 s for the total time of converting into factors. I thought that maintaining data as factors was somewhat more economical and faster than as characters...
Vincent
>
> Cheers,
> S
>
>
>> Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 million rows data frame it is still 2x faster than the elegant one line solution.
>>
>> Some figures of performance:
>>
>>> t <- proc.time()
>>> dl <- mclapply(lsessions, mcfun, mc.cores=cores)
>>> print(proc.time()-t)
>> utilisateur système écoulé
>> 171.894 47.696 28.713
>>
>>> l <- dl
>>> all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]])))
>>> names(all) = names(l[[1]])
>>> #attr(all, "row.names") = seq.int(all[[1]])
>>> attr(all, "row.names") = c(NA, -length(all[[1]]))
>>> class(all) = "data.frame"
>> utilisateur système écoulé
>> 0.412 0.280 0.708
>>
>>> all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
>> ...
>> utilisateur système écoulé
>> 4.852 2.349 7.038
>>
>>> my_df = do.call(rbind, dl)
>> utilisateur système écoulé
>> 9.791 5.411 15.039
>>
>> Thanks to both of you!
>>
>> Vincent
>>
>>
>> Le 29 juin 2011 à 21:48, Simon Urbanek a écrit :
>>
>>>
>>> On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
>>>
>>>> Is the slowdown happening while mclapply runs or while you're doing
>>>> the rbind? If the latter, I wonder if the code below is more efficient
>>>> than using rbind inside a loop:
>>>>
>>>> my_df = do.call( rbind , my_list_from_mclapply )
>>>>
>>>
>>> Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like
>>>
>>> l = my_list_from_mclapply
>>> all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]])))
>>> names(all) = names(l[[1]])
>>> attr(all, "row.names") = c(NA, -length(all[[1]]))
>>> class(all) = "data.frame"
>>>
>>> Again, make sure all the assumptions above are satisfied before using.
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>>
>>>>
>>>>
>>>> On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
>>>>> Hi all,
>>>>>
>>>>> I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great.
>>>>>
>>>>> But when I want to collect all processed elements of the returned list into one big data frame it takes ages.
>>>>>
>>>>> The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster...
>>>>>
>>>>> Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently?
>>>>>
>>>>> Thanks,
>>>>> Vincent
>>>>>
>>>>> _______________________________________________
>>>>> R-SIG-Mac mailing list
>>>>> R-SIG-Mac at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>>>
>>>>
>>>> _______________________________________________
>>>> R-SIG-Mac mailing list
>>>> R-SIG-Mac at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>>
>>>>
>>>
>>
>>
>
More information about the R-SIG-Mac
mailing list