[Bioc-devel] Memory issues with BiocParallel::SnowParam()

Thu Jul 16 05:56:35 CEST 2015

Correction below.

On 07/15/2015 04:14 PM, Valerie Obenchain wrote:
> Hi,
>
> BiocParallel in release and devel are quite similar so I'd like to
> narrow the focus to "before the changes to SnowParam" and after. This
> means BiocParallel 1.0.3 (R 3.1) vs BiocParallel 1.3.34 (R 3.2.1) which
> is the current devel.
>
> (1) master vs worker memory
>
> I'm more concerned about memory use on the master than the workers. This
> is where the code has changed the most and where the data are touched
> the most, ie, split into tasks and divided among workers. So it's the
> numbers in the mem_email files instead of the logs that I'm more focused
> on right now.
>
> (2) virtual vs actual memory
>
> The mem_email files show the virtual memory requested (Max vmem) for the
> job but not actual memory used. Can you output the actual used? That
> would be helpful.
>
> It does look like more memory is requested (not necessary used) in
> BiocParallel 1.3.34 vs 1.0.3.
>
> (3) SGE and mem_free
>
> This is more of an fyi -
>
> It was my understanding that once Grid engine offered the mem_free
> option, h_vmem was no longer necessary and actively discouraged by many
> cluster admins.

A Grid engine update can enable cgroups which allow mem_free to be 
enforced, ie, can kill jobs that exceed specified memory. It's in this 
case where mem_free serves the greatest purpose and h_vmem isn't so 
useful. Not sure if your cluster has cgroups enabled.

Valerie

>
> mem_free is used to track the available physical memory on the node and
> when a job is submitted to that node SGE subtracts the requested value
> from the available memory. h_vmem sets a threshold on the virtural
> memory requested - a job is (usually) killed when the requested memory
> exceeds this amount. Because many apps request more memory than they use
> scheduling by h_vmem causes the cluster to be very inefficient as you
> are reserving chunks of memory that never get used.
>
> You may want to follow up with your cluster admin about this. Maybe
> you've already gone down this road and there are relatively small
> default memory allocations per slot so it's necessary to specify h_vmem.
>
>
> I like how you've made information related to this bug report available
> in github. Having that space to record thoughts, store log/memory files
> and provide code really works well.
>
> Valerie
>
>
>
> On 07/14/2015 01:33 PM, Leonardo Collado Torres wrote:
>> Hi Valerie,
>>
>> I have re-run my two examples twice using "log = TRUE" and updated the
>> output at http://lcolladotor.github.io/SnowParam-memory/ As I was
>> writing this email (all morning...), I made a 4th run where I save the
>> gc() information to compare against R 3.1.x That fourth run kind of
>> debunked what I was taking away from replicate runs 2 and 3.
>>
>>
>>
>> ## Between runs
>>
>>
>> Between runs there's almost no variability with R 3.1.x in the memory
>> used, a little bit with 3.2.1 and more with R 3.2.0. The variability
>> shown could be due to using BiocParallel 1.2.9 in runs 2 to 4 vs 1.2.7
>> in run 1 for R 3.2.0 and 1.3.31 vs 1.3.34 in R 3.2.1.
>>
>>
>>
>> ## gc() output differences
>>
>> Now, from the gc() output from using "log = TRUE", I don't see nearly
>> any differences in the first example between R:
>>
>> * 3.2.0
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.2.o6463210
>>
>> * 3.2.1
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.2.x.o6463213
>>
>>
>> which makes sense given that the memory was the same
>>
>> * 3.2.0
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.2.txt#L24
>>
>> * 3.2.1
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.2.x.txt#L24
>>
>>
>> However, I do notice a large difference versus R 3.1.x
>>
>> * log
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.1.x.o6463536
>>
>> * mem info
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.1.x.txt#L50
>>
>>
>>
>>
>> In the derfinder example, the memory goes from 10.9 GB in R 3.2.0 to
>> 12.87 GB in R 3.2.1 (run 2) with SnowParam(). The 18% increase
>> reported there is kind of similar to the increase from comparing the
>> max used mb output from gc():
>>
>> # R 3.2.1 vs 3.2.0
>> * gc() max mem used mb ratio: (303.6 + 548.9) / (251.3 + 547.6) =~ 1.07
>> * max mem used in GB from cluster email:12.871 / 10.904  =~ 1.18
>>
>>  From this observation, maybe using "log = TRUE" and checking that the
>> memory used reported by gc() goes down will be useful for testing
>> changes in BiocParallel() to see if they increase/decrease memory use.
>> When I run the tests locally (using 2 cores), I get similar ratios;
>> data at
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/local_run.txt.
>>
>>
>>
>> However, the same type of comparison but now between R 3.2.1 against R
>> 3.1.x shows that the numbers can be off. Although the 2.8 ratio is
>> closer to what I saw in my analysis scenario ( > 2.5).
>>
>> # R 3.2.1 vs 3.1.x
>> * gc() max mem used mb ratio: (303.6 + 548.9) / (236 + 66) =~ 2.83
>> * max mem used in GB from cluster email:12.871 / 7.175  =~ 1.79
>>
>> Numbers from:
>> # gc() info
>> * 3.2.1
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.x.o6463204#L15-L16
>>
>> * 3.2.0
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.o6463201#L15-L16
>>
>> * 3.1.x
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.1.x.o6463545#L10-L11
>>
>> # cluster mem info
>> * 3.2.1
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.2.x.txt#L24
>>
>> * 3.2.0
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.2.txt#L24
>>
>> * 3.1.x
>> https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.1.x.txt#L50
>>
>>
>>
>>
>>
>>
>>
>> ## Max mem vs gc()
>>
>>
>> Comparing the memory used in GB from the cluster email to the max used
>> mb from gc() multiplied by 10 (number of cores used), I see that the
>> ratio is somewhat consistent.
>>
>>
>> # first example, R 3.2.0, SnowParam() run 2
>> 7.285 * 1024 / (10 * (23.9 + 461.8)) =~ 1.54
>> # first example, R 3.2.1, SnowParam(), run 2
>> 7.286 * 1024 / (10 * (24 + 461.8)) =~ 1.54
>>
>> # derfinder example, R 3.2.0, SnowParam(), run 2
>> 10.904 * 1024 / (10 * (251.3 + 547.6)) =~ 1.4
>> # derfinder example, R 3.2.1, SnowParam(), run 2
>> 12.871 * 1024 / (10 * (303.6 + 548.9)) =~ 1.55
>>
>> # derfinder example, R 3.2.0, MulticoreParam(), run 2
>> 13.789 * 1024 / (10 * (230.8 + 770.2)) =~ 1.41
>> # derfinder example, R 3.2.1, MulticoreParam(), run 2
>> 13.671 * 1024 / (10 * (245.9 + 757.5)) =~ 1.4
>>
>> And from this other observation, maybe I can use the gc() output from
>> "log = TRUE" to get an idea if the memory use reported by my cluster
>> is in line with previous runs, or if there's a cluster issue. This
>> ratio could also be used to compare different cluster environments to
>> see which ones are reporting greater/lower memory use.
>>
>>
>> However, the above ratios are different with R 3.1.x
>>
>> # first example, R 3.1.x, SnowParam(), run 4
>> 5.036 * 1024 / (10 * (32.1 + 218.5)) =~ 2.06
>>
>> # derfinder example, R 3.1.x, SnowParam(), run 4
>> 7.175 * 1024 / (10 * (236 + 66)) =~ 2.43
>> # derfinder example, R 3.1.x, MulticoreParam(), run 4
>> 8.473 * 1024 / (10 * (240.7 + 189.1)) =~ 2.02
>>
>>
>> I'm not sure if this is a hint, but the largest difference between R
>> 3.1.x and the other two is in the max used mb from Vcells.
>>
>>
>>
>>
>> Summarizing, I was thinking that
>>
>> (A) we could use the output from gc() to compare between versions and
>> check which changes lowered the memory required,
>> and (B) estimate the actual memory needed as measured by the cluster
>> as well as compare cluster environments.
>>
>> However, the gc() numbers from R 3.1.x (only available on the 4th
>> replicate run) don't seem to support these thoughts.
>>
>>
>>
>> Or do you interpret these numbers differently?
>>
>>
>>
>> Best,
>> Leo
>>
>> On Sun, Jul 12, 2015 at 11:00 AM, Valerie Obenchain
>> <vobencha at fredhutch.org> wrote:
>>> Hi Leo,
>>>
>>> Thanks for the sample code I'll take a look.
>>>
>>> You're right, SnowParam has changed quite at bit - logging, error
>>> handling
>>> etc. The memory use you're seeing is a concern - thanks for reporting
>>> it.
>>>
>>> As an fyi, the log output for SnowParam and MulticoreParam now includes
>>> gc(), system.time() and other stats from the workers.
>>>
>>> SnowParam(log = TRUE)
>>>
>>>
>>> Valerie
>>>
>>>
>>>
>>>
>>> On 07/10/2015 01:12 PM, Leonardo Collado Torres wrote:
>>>>
>>>> Hi,
>>>>
>>>> I ran my example code with SerialParam() which had a negligible 4%
>>>> memory increase between R 3.2.x and 3.1.x This 4% could very well
>>>> fluctuate a little bit and might be non significantly different from 0
>>>> if I run the test more times.
>>>>
>>>> I also added a second example using code based on my analysis script.
>>>> With SerialParam(), the memory change is 13%, but with SnowParam()
>>>> it's 82% between the R versions mentioned already using 10 cores. It's
>>>> still far from the > 150% increase (2.5 fold change) I'm seeing with
>>>> the real data.
>>>>
>>>> I initially thought that these observations ruled out everything else
>>>> except SnowParam(). However, maybe the initial 13% memory increase
>>>> multiplied by 10 (well, less then linear) is what I'm seeing with 10
>>>> cores (82% increase).
>>>>
>>>> The updated information is available at
>>>> http://lcolladotor.github.io/SnowParam-memory/
>>>>
>>>>
>>>>
>>>> As for what Vincent suggested of an AMI and EC2, I don't have
>>>> experience with them. I'm not sure I'll be able to look into them and
>>>> create a reproducible environment.
>>>>
>>>>
>>>> Cheers,
>>>> Leo
>>>>
>>>> On Fri, Jul 10, 2015 at 7:12 AM, Vincent Carey
>>>> <stvjc at channing.harvard.edu> wrote:
>>>>>
>>>>> I have had (potentially transient and environment-related) problems
>>>>> with
>>>>> bplapply
>>>>> in gQTLstats.   I substituted the foreach abstractions and the code
>>>>> worked.
>>>>> I still
>>>>> have difficulty seeing how to diagnose the trouble I ran into.
>>>>>
>>>>> I'd suggest that you code so that you can easily substitute
>>>>> parallel- or
>>>>> foreach- or
>>>>> BatchJobs-based cluster control.  This can help crudely isolate the
>>>>> source
>>>>> of trouble.
>>>>>
>>>>> It would be very nice to have a way of measuring resource usage in
>>>>> cluster
>>>>> settings,
>>>>> both for diagnosis and strategy selection.  For jobs that succeed,
>>>>> BatchJobs
>>>>> records
>>>>> memory used in its registry database, based on gc().  I would hope
>>>>> that
>>>>> there are
>>>>> tools that could be used to help one figure out how to factor a
>>>>> task so
>>>>> that
>>>>> it is feasible
>>>>> given some view of environment constraints.
>>>>>
>>>>> It might be useful for you to build an AMI and then a cluster that
>>>>> allows
>>>>> replication of
>>>>> the condition you are seeing on EC2.  This could help with
>>>>> diagnosis and
>>>>> might be
>>>>> a basis for defining better instrumentation tools for both
>>>>> diagnosis and
>>>>> planning.
>>>>>
>>>>> On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres
>>>>> <lcollado at jhu.edu>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a script that at some point generates a list of DataFrame
>>>>>> objects which are rather large matrices. I then feed this list to
>>>>>> BiocParallel::bplapply() and process them.
>>>>>>
>>>>>> Previously, I noticed that in our SGE managed cluster using
>>>>>> MulticoreParam() lead to 5 to 8 times higher memory usage as I posted
>>>>>> in https://support.bioconductor.org/p/62551/#62877. Martin posted in
>>>>>> https://support.bioconductor.org/p/62551/#62880 that "Probably the
>>>>>> tools used to assess memory usage are misleading you." This could be
>>>>>> true, but they are the tools that determine memory usage for all jobs
>>>>>> in the cluster. Meaning that if my memory usage blows up according to
>>>>>> these tools, my jobs get killed.
>>>>>>
>>>>>> That was with R 3.1.x and in particular running
>>>>>>
>>>>>>
>>>>>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
>>>>>>
>>>>>> with
>>>>>>
>>>>>> $ sh step1-fullCoverage.sh brainspan
>>>>>>
>>>>>> which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores.
>>>>>> I recently tried to reproduce this (to check changes in run time
>>>>>> given
>>>>>> rtracklayer's improvements with BigWig files) using R 3.2.x and the
>>>>>> memory went up to 450 GB before the job got killed given the maximum
>>>>>> memory I specified for the job. The same is true using R 3.2.0.
>>>>>>
>>>>>> Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one
>>>>>> bug fix is different, for other code not used in this script). I know
>>>>>> that BiocParallel changed quite a bit between those versions, and in
>>>>>> particular SnowParam(). So that's why my prime suspect is
>>>>>> BiocParallel.
>>>>>>
>>>>>> I made a smaller reproducible example which you can view at
>>>>>> http://lcolladotor.github.io/SnowParam-memory/. This example uses a
>>>>>> list of data frames with random data, and also uses 10 cores. You can
>>>>>> see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam()
>>>>>> does use more memory than SnowParam(), as reported by SGE. Beyond the
>>>>>> actual session info differences due to changes in BiocParalell's
>>>>>> implementation, I noticed that the cluster type changed from PSOCK to
>>>>>> SOCK. I ignore if this could explain the memory increase.
>>>>>>
>>>>>> The example doesn't generate the huge fold change between R 3.1.x and
>>>>>> the other two versions (still 1.27x > 1x) that I see with my analysis
>>>>>> script, so in that sense it's not the best example for the problem
>>>>>> I'm
>>>>>> observing. My tests with
>>>>>>
>>>>>>
>>>>>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
>>>>>>
>>>>>> were between June 23rd and 28th, so maybe some recent changes in
>>>>>> BiocParallel addressed this issue.
>>>>>>
>>>>>>
>>>>>> I'm not sure how to proceed now. One idea is to make another example
>>>>>> with the same type of objects and operations I use in my analysis
>>>>>> script.
>>>>>>
>>>>>> A second one is to run my analysis script with SerialParam() on the
>>>>>> different R versions to check if they use different amounts of memory
>>>>>> which would suggest that the memory issue is not caused by
>>>>>> SnowParam(). For example, maybe changes in rtracklayer are the ones
>>>>>> driving the huge memory changes I'm seeing in my analysis scripts.
>>>>>>
>>>>>> However, I don't really suspect rtracklayer given the memory load
>>>>>> reported that I checked manually a couple of times with "qmem". I
>>>>>> believe that the memory blows up at
>>>>>>
>>>>>>
>>>>>> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124
>>>>>>
>>>>>> which in turn uses derfinder::filterData(). This function imports:
>>>>>>
>>>>>> '[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges
>>>>>> Rle, DataFrame from S4Vectors
>>>>>> Reduce method from S4Vectors
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Leo
>>>>>>
>>>>>>
>>>>>> History of analysis scripts doesn't reveal any other leads
>>>>>>
>>>>>>
>>>>>> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, Seattle, WA 98109
>>>
>>> Email: vobencha at fredhutch.org
>>> Phone: (206) 667-3158
>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fredhutch.org
Phone: (206) 667-3158