[Bioc-devel] Trying to reduce the memory overhead when using mclapply

Ryan rct at thompsonclan.org
Thu Nov 14 10:09:37 CET 2013


The minimize the additional memory used by mclapply, remember that 
mclapply works by forking processes, and the advantage of this is that 
as long as an object is not modified in either the parent or child, 
they will share the memory for that object, which effectively means 
that a child process *only* uses a significant amount of memory when it 
modifies existing objects (triggering creation of a copy) or creates a 
new object.

In your case, there's no point in splitting the data (which results in 
creating copies). You only have to split the indices using 
parallel::splitIndices. I've tried to incorporate this into your gist: 
https://gist.github.com/DarwinAwardWinner/7463652

The key line is:

	res4 <- mclapply(splitIndices(nrow(data), opt$mcores), function(i) 
rowMeans(data[i,]), mc.cores=opt$mcores)

Also, for concatenating the results, you can use "do.call(c, 
unname(res4))".

On Thu Nov 14 00:13:41 2013, Leonardo Collado Torres wrote:
> Dear BioC developers,
>
> I am trying to understand how to use mclapply() without blowing up the
> memory usage and need some help.
>
> My use case is splitting a large IRanges::DataFrame() into chunks, and
> feeding these chunks to mclapply(). Let say that I am using n cores and
> that the operation I am doing uses K memory units.
>
> I understand that the individual jobs in mclapply() cannot detect how the
> others are doing and if they need to run gc(). While this coupled n * K
>   could explain a higher memory usage, I am running into higher than
> expected memory loads.
>
> I have tried
> 1) pre-splitting the data into a list (one element per chunk),
> 2) assigning the elements of the list as elements of an environment and the
> using mclapply() over a set of indexes,
> 3) saving each chunk on its own Rdata file, then using mclapply with a
> function that loads the appropriate chunk and then performs the operation
> of interest.
>
> Strategy 3 performs best in terms of max memory usage, but I am afraid that
> it is more error prone due to having to write to disk.
>
> Do you have any other ideas/tips on how to reduce the memory load? In other
> words, is there a strategy to reduce the number of copies as much as
> possible when using mclapply()?
>
>
> I have a full example (with data.frame instead of DataFrame) and code
> comparing the three options described above at http://bit.ly/1ar71yA
>
>
> Thank you,
> Leonardo
>
> Leonardo Collado Torres, PhD student
> Department of Biostatistics
> Johns Hopkins University
> Bloomberg School of Public Health
> Website: http://www.biostat.jhsph.edu/~lcollado/
> Blog: http://bit.ly/FellBit
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list