[Bioc-devel] Trying to reduce the memory overhead when using mclapply
Ryan
rct at thompsonclan.org
Thu Nov 14 10:09:37 CET 2013
The minimize the additional memory used by mclapply, remember that
mclapply works by forking processes, and the advantage of this is that
as long as an object is not modified in either the parent or child,
they will share the memory for that object, which effectively means
that a child process *only* uses a significant amount of memory when it
modifies existing objects (triggering creation of a copy) or creates a
new object.
In your case, there's no point in splitting the data (which results in
creating copies). You only have to split the indices using
parallel::splitIndices. I've tried to incorporate this into your gist:
https://gist.github.com/DarwinAwardWinner/7463652
The key line is:
res4 <- mclapply(splitIndices(nrow(data), opt$mcores), function(i)
rowMeans(data[i,]), mc.cores=opt$mcores)
Also, for concatenating the results, you can use "do.call(c,
unname(res4))".
On Thu Nov 14 00:13:41 2013, Leonardo Collado Torres wrote:
> Dear BioC developers,
>
> I am trying to understand how to use mclapply() without blowing up the
> memory usage and need some help.
>
> My use case is splitting a large IRanges::DataFrame() into chunks, and
> feeding these chunks to mclapply(). Let say that I am using n cores and
> that the operation I am doing uses K memory units.
>
> I understand that the individual jobs in mclapply() cannot detect how the
> others are doing and if they need to run gc(). While this coupled n * K
> could explain a higher memory usage, I am running into higher than
> expected memory loads.
>
> I have tried
> 1) pre-splitting the data into a list (one element per chunk),
> 2) assigning the elements of the list as elements of an environment and the
> using mclapply() over a set of indexes,
> 3) saving each chunk on its own Rdata file, then using mclapply with a
> function that loads the appropriate chunk and then performs the operation
> of interest.
>
> Strategy 3 performs best in terms of max memory usage, but I am afraid that
> it is more error prone due to having to write to disk.
>
> Do you have any other ideas/tips on how to reduce the memory load? In other
> words, is there a strategy to reduce the number of copies as much as
> possible when using mclapply()?
>
>
> I have a full example (with data.frame instead of DataFrame) and code
> comparing the three options described above at http://bit.ly/1ar71yA
>
>
> Thank you,
> Leonardo
>
> Leonardo Collado Torres, PhD student
> Department of Biostatistics
> Johns Hopkins University
> Bloomberg School of Public Health
> Website: http://www.biostat.jhsph.edu/~lcollado/
> Blog: http://bit.ly/FellBit
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list