[Bioc-devel] Trying to reduce the memory overhead when using mclapply
Martin Morgan
mtmorgan at fhcrc.org
Thu Nov 14 16:47:49 CET 2013
On 11/14/2013 12:13 AM, Leonardo Collado Torres wrote:
> Dear BioC developers,
>
> I am trying to understand how to use mclapply() without blowing up the
> memory usage and need some help.
>
> My use case is splitting a large IRanges::DataFrame() into chunks, and
> feeding these chunks to mclapply(). Let say that I am using n cores and
> that the operation I am doing uses K memory units.
That the data frame can be parallelized across rows implies that it can also be
vectorized. It would be useful to confirm that your complicated function is
actually fully vectorized, because the speed gains from vectorization can be
100-1000 fold compared to the speed gains (and added complexity) of parallel
evaluation.
A simple necessary condition might be that the function scales linearly or
better with the number of rows, especially as the number of rows gets large.
Even then there may be some obvious ways of speeding up the vectorized code,
e.g., hoisting constant expressions from inside for loops or lapply's.
There are some incomplete hints in the 'Efficient R' links at
http://bioconductor.org/help/course-materials/2013/UnderstandingRBioc2013/ and
the 'working with large data' section of
http://bioconductor.org/help/course-materials/2013/Akron-Oct-2013/StatisticalComputing.pdf.
Martin
>
> I understand that the individual jobs in mclapply() cannot detect how the
> others are doing and if they need to run gc(). While this coupled n * K
> could explain a higher memory usage, I am running into higher than
> expected memory loads.
>
> I have tried
> 1) pre-splitting the data into a list (one element per chunk),
> 2) assigning the elements of the list as elements of an environment and the
> using mclapply() over a set of indexes,
> 3) saving each chunk on its own Rdata file, then using mclapply with a
> function that loads the appropriate chunk and then performs the operation
> of interest.
>
> Strategy 3 performs best in terms of max memory usage, but I am afraid that
> it is more error prone due to having to write to disk.
>
> Do you have any other ideas/tips on how to reduce the memory load? In other
> words, is there a strategy to reduce the number of copies as much as
> possible when using mclapply()?
>
>
> I have a full example (with data.frame instead of DataFrame) and code
> comparing the three options described above at http://bit.ly/1ar71yA
>
>
> Thank you,
> Leonardo
>
> Leonardo Collado Torres, PhD student
> Department of Biostatistics
> Johns Hopkins University
> Bloomberg School of Public Health
> Website: http://www.biostat.jhsph.edu/~lcollado/
> Blog: http://bit.ly/FellBit
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list