[Bioc-devel] Memory issues with BiocParallel::SnowParam()

Valerie Obenchain vobencha at fredhutch.org
Sun Jul 12 17:24:51 CEST 2015


On 07/10/2015 04:12 AM, Vincent Carey wrote:
> I have had (potentially transient and environment-related) problems with
> bplapply
> in gQTLstats.

Was the problem during build or check where a man page example or unit 
test could be isolated as the problem?

   I substituted the foreach abstractions and the code
> worked.  I still
> have difficulty seeing how to diagnose the trouble I ran into.
> I'd suggest that you code so that you can easily substitute parallel- or
> foreach- or
> BatchJobs-based cluster control.  This can help crudely isolate the source
> of trouble.
> It would be very nice to have a way of measuring resource usage in cluster
> settings,
> both for diagnosis and strategy selection.

SnowParam and MulticoreParam log output includes gc(), system.time() and 
all messages sent to stdout and stderr. Turn logging on with,

SnowParam(log = TRUE)

If files are more convenient, logs are written to files (one per tasks) 
with 'logdir',

SnowParam(log = TRUE, logdir  tempfile())

  For jobs that succeed,
> BatchJobs records
> memory used in its registry database, based on gc().  I would hope that
> there are
> tools that could be used to help one figure out how to factor a task so
> that it is feasible
> given some view of environment constraints.

Once you have an idea of memory use from the log output you can modify 
how 'X' is divided over the workers with the 'tasks' arg.

A job is defined as the 'X' in bplapply(). A task is the element(s) of 
'X' sent to a worker, eg,

bplappy(X = 1:5, sqrt)

SnowParam()                ## X is divided ~ evenly over max workers
SnowParam(workers = 3)     ## X divided ~ evenly over 3 workers
SnowParam(tasks = 5)       ## X divided into 5 tasks
SnowParam(workers = 2, tasks = 3) ## X divided by 3, run on 2 workers

If you have problems with BiocParallel, no matter how transient or 
difficult to reproduce, please let me know.


> It might be useful for you to build an AMI and then a cluster that allows
> replication of
> the condition you are seeing on EC2.  This could help with diagnosis and
> might be
> a basis for defining better instrumentation tools for both diagnosis and
> planning.
> On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres <lcollado at jhu.edu>
> wrote:
>> Hi,
>> I have a script that at some point generates a list of DataFrame
>> objects which are rather large matrices. I then feed this list to
>> BiocParallel::bplapply() and process them.
>> Previously, I noticed that in our SGE managed cluster using
>> MulticoreParam() lead to 5 to 8 times higher memory usage as I posted
>> in https://support.bioconductor.org/p/62551/#62877. Martin posted in
>> https://support.bioconductor.org/p/62551/#62880 that "Probably the
>> tools used to assess memory usage are misleading you." This could be
>> true, but they are the tools that determine memory usage for all jobs
>> in the cluster. Meaning that if my memory usage blows up according to
>> these tools, my jobs get killed.
>> That was with R 3.1.x and in particular running
>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
>> with
>> $ sh step1-fullCoverage.sh brainspan
>> which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores.
>> I recently tried to reproduce this (to check changes in run time given
>> rtracklayer's improvements with BigWig files) using R 3.2.x and the
>> memory went up to 450 GB before the job got killed given the maximum
>> memory I specified for the job. The same is true using R 3.2.0.
>> Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one
>> bug fix is different, for other code not used in this script). I know
>> that BiocParallel changed quite a bit between those versions, and in
>> particular SnowParam(). So that's why my prime suspect is
>> BiocParallel.
>> I made a smaller reproducible example which you can view at
>> http://lcolladotor.github.io/SnowParam-memory/. This example uses a
>> list of data frames with random data, and also uses 10 cores. You can
>> see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam()
>> does use more memory than SnowParam(), as reported by SGE. Beyond the
>> actual session info differences due to changes in BiocParalell's
>> implementation, I noticed that the cluster type changed from PSOCK to
>> SOCK. I ignore if this could explain the memory increase.
>> The example doesn't generate the huge fold change between R 3.1.x and
>> the other two versions (still 1.27x > 1x) that I see with my analysis
>> script, so in that sense it's not the best example for the problem I'm
>> observing. My tests with
>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
>> were between June 23rd and 28th, so maybe some recent changes in
>> BiocParallel addressed this issue.
>> I'm not sure how to proceed now. One idea is to make another example
>> with the same type of objects and operations I use in my analysis
>> script.
>> A second one is to run my analysis script with SerialParam() on the
>> different R versions to check if they use different amounts of memory
>> which would suggest that the memory issue is not caused by
>> SnowParam(). For example, maybe changes in rtracklayer are the ones
>> driving the huge memory changes I'm seeing in my analysis scripts.
>> However, I don't really suspect rtracklayer given the memory load
>> reported that I checked manually a couple of times with "qmem". I
>> believe that the memory blows up at
>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124
>> which in turn uses derfinder::filterData(). This function imports:
>> '[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges
>> Rle, DataFrame from S4Vectors
>> Reduce method from S4Vectors
>> https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51
>> Best,
>> Leo
>> History of analysis scripts doesn't reveal any other leads
>> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh
>> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fredhutch.org
Phone: (206) 667-3158

More information about the Bioc-devel mailing list