[Bioc-devel] Memory issues with BiocParallel::SnowParam()

Leonardo Collado Torres lcollado at jhu.edu
Fri Jul 10 22:12:59 CEST 2015


Hi,

I ran my example code with SerialParam() which had a negligible 4%
memory increase between R 3.2.x and 3.1.x This 4% could very well
fluctuate a little bit and might be non significantly different from 0
if I run the test more times.

I also added a second example using code based on my analysis script.
With SerialParam(), the memory change is 13%, but with SnowParam()
it's 82% between the R versions mentioned already using 10 cores. It's
still far from the > 150% increase (2.5 fold change) I'm seeing with
the real data.

I initially thought that these observations ruled out everything else
except SnowParam(). However, maybe the initial 13% memory increase
multiplied by 10 (well, less then linear) is what I'm seeing with 10
cores (82% increase).

The updated information is available at
http://lcolladotor.github.io/SnowParam-memory/



As for what Vincent suggested of an AMI and EC2, I don't have
experience with them. I'm not sure I'll be able to look into them and
create a reproducible environment.


Cheers,
Leo

On Fri, Jul 10, 2015 at 7:12 AM, Vincent Carey
<stvjc at channing.harvard.edu> wrote:
> I have had (potentially transient and environment-related) problems with
> bplapply
> in gQTLstats.   I substituted the foreach abstractions and the code worked.
> I still
> have difficulty seeing how to diagnose the trouble I ran into.
>
> I'd suggest that you code so that you can easily substitute parallel- or
> foreach- or
> BatchJobs-based cluster control.  This can help crudely isolate the source
> of trouble.
>
> It would be very nice to have a way of measuring resource usage in cluster
> settings,
> both for diagnosis and strategy selection.  For jobs that succeed, BatchJobs
> records
> memory used in its registry database, based on gc().  I would hope that
> there are
> tools that could be used to help one figure out how to factor a task so that
> it is feasible
> given some view of environment constraints.
>
> It might be useful for you to build an AMI and then a cluster that allows
> replication of
> the condition you are seeing on EC2.  This could help with diagnosis and
> might be
> a basis for defining better instrumentation tools for both diagnosis and
> planning.
>
> On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres <lcollado at jhu.edu>
> wrote:
>>
>> Hi,
>>
>> I have a script that at some point generates a list of DataFrame
>> objects which are rather large matrices. I then feed this list to
>> BiocParallel::bplapply() and process them.
>>
>> Previously, I noticed that in our SGE managed cluster using
>> MulticoreParam() lead to 5 to 8 times higher memory usage as I posted
>> in https://support.bioconductor.org/p/62551/#62877. Martin posted in
>> https://support.bioconductor.org/p/62551/#62880 that "Probably the
>> tools used to assess memory usage are misleading you." This could be
>> true, but they are the tools that determine memory usage for all jobs
>> in the cluster. Meaning that if my memory usage blows up according to
>> these tools, my jobs get killed.
>>
>> That was with R 3.1.x and in particular running
>>
>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
>> with
>>
>> $ sh step1-fullCoverage.sh brainspan
>>
>> which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores.
>> I recently tried to reproduce this (to check changes in run time given
>> rtracklayer's improvements with BigWig files) using R 3.2.x and the
>> memory went up to 450 GB before the job got killed given the maximum
>> memory I specified for the job. The same is true using R 3.2.0.
>>
>> Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one
>> bug fix is different, for other code not used in this script). I know
>> that BiocParallel changed quite a bit between those versions, and in
>> particular SnowParam(). So that's why my prime suspect is
>> BiocParallel.
>>
>> I made a smaller reproducible example which you can view at
>> http://lcolladotor.github.io/SnowParam-memory/. This example uses a
>> list of data frames with random data, and also uses 10 cores. You can
>> see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam()
>> does use more memory than SnowParam(), as reported by SGE. Beyond the
>> actual session info differences due to changes in BiocParalell's
>> implementation, I noticed that the cluster type changed from PSOCK to
>> SOCK. I ignore if this could explain the memory increase.
>>
>> The example doesn't generate the huge fold change between R 3.1.x and
>> the other two versions (still 1.27x > 1x) that I see with my analysis
>> script, so in that sense it's not the best example for the problem I'm
>> observing. My tests with
>>
>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
>> were between June 23rd and 28th, so maybe some recent changes in
>> BiocParallel addressed this issue.
>>
>>
>> I'm not sure how to proceed now. One idea is to make another example
>> with the same type of objects and operations I use in my analysis
>> script.
>>
>> A second one is to run my analysis script with SerialParam() on the
>> different R versions to check if they use different amounts of memory
>> which would suggest that the memory issue is not caused by
>> SnowParam(). For example, maybe changes in rtracklayer are the ones
>> driving the huge memory changes I'm seeing in my analysis scripts.
>>
>> However, I don't really suspect rtracklayer given the memory load
>> reported that I checked manually a couple of times with "qmem". I
>> believe that the memory blows up at
>>
>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124
>> which in turn uses derfinder::filterData(). This function imports:
>>
>> '[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges
>> Rle, DataFrame from S4Vectors
>> Reduce method from S4Vectors
>>
>>
>> https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51
>>
>>
>> Best,
>> Leo
>>
>>
>> History of analysis scripts doesn't reveal any other leads
>>
>> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh
>>
>> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>



More information about the Bioc-devel mailing list