[Bioc-devel] Memory issues with BiocParallel::SnowParam()

Valerie Obenchain vobencha at fredhutch.org
Sun Jul 12 17:00:27 CEST 2015


Hi Leo,

Thanks for the sample code I'll take a look.

You're right, SnowParam has changed quite at bit - logging, error 
handling etc. The memory use you're seeing is a concern - thanks for 
reporting it.

As an fyi, the log output for SnowParam and MulticoreParam now includes 
gc(), system.time() and other stats from the workers.

SnowParam(log = TRUE)


Valerie



On 07/10/2015 01:12 PM, Leonardo Collado Torres wrote:
> Hi,
>
> I ran my example code with SerialParam() which had a negligible 4%
> memory increase between R 3.2.x and 3.1.x This 4% could very well
> fluctuate a little bit and might be non significantly different from 0
> if I run the test more times.
>
> I also added a second example using code based on my analysis script.
> With SerialParam(), the memory change is 13%, but with SnowParam()
> it's 82% between the R versions mentioned already using 10 cores. It's
> still far from the > 150% increase (2.5 fold change) I'm seeing with
> the real data.
>
> I initially thought that these observations ruled out everything else
> except SnowParam(). However, maybe the initial 13% memory increase
> multiplied by 10 (well, less then linear) is what I'm seeing with 10
> cores (82% increase).
>
> The updated information is available at
> http://lcolladotor.github.io/SnowParam-memory/
>
>
>
> As for what Vincent suggested of an AMI and EC2, I don't have
> experience with them. I'm not sure I'll be able to look into them and
> create a reproducible environment.
>
>
> Cheers,
> Leo
>
> On Fri, Jul 10, 2015 at 7:12 AM, Vincent Carey
> <stvjc at channing.harvard.edu> wrote:
>> I have had (potentially transient and environment-related) problems with
>> bplapply
>> in gQTLstats.   I substituted the foreach abstractions and the code worked.
>> I still
>> have difficulty seeing how to diagnose the trouble I ran into.
>>
>> I'd suggest that you code so that you can easily substitute parallel- or
>> foreach- or
>> BatchJobs-based cluster control.  This can help crudely isolate the source
>> of trouble.
>>
>> It would be very nice to have a way of measuring resource usage in cluster
>> settings,
>> both for diagnosis and strategy selection.  For jobs that succeed, BatchJobs
>> records
>> memory used in its registry database, based on gc().  I would hope that
>> there are
>> tools that could be used to help one figure out how to factor a task so that
>> it is feasible
>> given some view of environment constraints.
>>
>> It might be useful for you to build an AMI and then a cluster that allows
>> replication of
>> the condition you are seeing on EC2.  This could help with diagnosis and
>> might be
>> a basis for defining better instrumentation tools for both diagnosis and
>> planning.
>>
>> On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres <lcollado at jhu.edu>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have a script that at some point generates a list of DataFrame
>>> objects which are rather large matrices. I then feed this list to
>>> BiocParallel::bplapply() and process them.
>>>
>>> Previously, I noticed that in our SGE managed cluster using
>>> MulticoreParam() lead to 5 to 8 times higher memory usage as I posted
>>> in https://support.bioconductor.org/p/62551/#62877. Martin posted in
>>> https://support.bioconductor.org/p/62551/#62880 that "Probably the
>>> tools used to assess memory usage are misleading you." This could be
>>> true, but they are the tools that determine memory usage for all jobs
>>> in the cluster. Meaning that if my memory usage blows up according to
>>> these tools, my jobs get killed.
>>>
>>> That was with R 3.1.x and in particular running
>>>
>>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
>>> with
>>>
>>> $ sh step1-fullCoverage.sh brainspan
>>>
>>> which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores.
>>> I recently tried to reproduce this (to check changes in run time given
>>> rtracklayer's improvements with BigWig files) using R 3.2.x and the
>>> memory went up to 450 GB before the job got killed given the maximum
>>> memory I specified for the job. The same is true using R 3.2.0.
>>>
>>> Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one
>>> bug fix is different, for other code not used in this script). I know
>>> that BiocParallel changed quite a bit between those versions, and in
>>> particular SnowParam(). So that's why my prime suspect is
>>> BiocParallel.
>>>
>>> I made a smaller reproducible example which you can view at
>>> http://lcolladotor.github.io/SnowParam-memory/. This example uses a
>>> list of data frames with random data, and also uses 10 cores. You can
>>> see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam()
>>> does use more memory than SnowParam(), as reported by SGE. Beyond the
>>> actual session info differences due to changes in BiocParalell's
>>> implementation, I noticed that the cluster type changed from PSOCK to
>>> SOCK. I ignore if this could explain the memory increase.
>>>
>>> The example doesn't generate the huge fold change between R 3.1.x and
>>> the other two versions (still 1.27x > 1x) that I see with my analysis
>>> script, so in that sense it's not the best example for the problem I'm
>>> observing. My tests with
>>>
>>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
>>> were between June 23rd and 28th, so maybe some recent changes in
>>> BiocParallel addressed this issue.
>>>
>>>
>>> I'm not sure how to proceed now. One idea is to make another example
>>> with the same type of objects and operations I use in my analysis
>>> script.
>>>
>>> A second one is to run my analysis script with SerialParam() on the
>>> different R versions to check if they use different amounts of memory
>>> which would suggest that the memory issue is not caused by
>>> SnowParam(). For example, maybe changes in rtracklayer are the ones
>>> driving the huge memory changes I'm seeing in my analysis scripts.
>>>
>>> However, I don't really suspect rtracklayer given the memory load
>>> reported that I checked manually a couple of times with "qmem". I
>>> believe that the memory blows up at
>>>
>>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124
>>> which in turn uses derfinder::filterData(). This function imports:
>>>
>>> '[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges
>>> Rle, DataFrame from S4Vectors
>>> Reduce method from S4Vectors
>>>
>>>
>>> https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51
>>>
>>>
>>> Best,
>>> Leo
>>>
>>>
>>> History of analysis scripts doesn't reveal any other leads
>>>
>>> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh
>>>
>>> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fredhutch.org
Phone: (206) 667-3158



More information about the Bioc-devel mailing list