[R-sig-hpc] Unreproducable crashes of R-instances on cluster running Torque

Mon May 13 13:20:39 CEST 2013

Sean Davis <sdavis2 at mail.nih.gov> writes:

> On Mon, May 13, 2013 at 4:08 AM, Rainer M. Krug <Rainer at krugs.de> wrote:
>> Simon Urbanek <simon.urbanek at r-project.org> writes:
>>
>>> On May 2, 2013, at 9:46 AM, Till Francke wrote:
>>>
>>>> Dear Sean,
>>>> thanks for your suggestions in spite of my obscure descriptions. I'll try to clarify some points:
>>>>
>>>>> R messages about "memory allocation problems"
>>>>> usually mean that your code is asking for more memory than is
>>>>> available on the machine.
>>>> I get things like
>>>>      Error: cannot allocate vector of size 304.6 Mb
>>>> However, the jobs are started with the Torque option
>>>>      #PBS -l mem=3gb
>>>> When I submit this job alone, everything works like a charm, so 3 gb seem to suffice, right?
>>>
>>> No, the 300MB are *in addition* to all other memory allocated by R -
>>> probably very close to the 3Gb. Also note that mem is total memory
>>> over all, not per process, so some may get very little (I don't use
>>> Torque, though so this is just based on the docs).
>>
>> If I remember correctly, memory fragmentation plays an important role
>> for R (still in version 3.0.0?), so that one continuous memory block
>> needs to be available to be used - otherwise one can get these error
>> messages even if enough memory is available, but fragmented in smaller
>> blocks (or does torque take care of memory fragmentation?)
>
> Torque is a batch system.  The underlying OS (typically linux) is
> responsible for memory management.

True - makes sense.

>
>>>
>>>
>>>> With 20 or more jobs, I get the memory message. I assumed Torque
>>>> would only start a job if the ressources are available, is that a
>>>> misconception?
>>>>
>>>>
>>>>> By "crashing a node of the cluster", I
>>>>> suspect you mean that the machine becomes unreachable; this is often
>>>>> due to the machine swapping large blocks of memory (again, a memory
>>>>> issue in user code).
>>>> I cannot tell more precisely; the admin just told me he had to
>>>> reboot this node. Before that, the entire queue-handling of Torque
>>>> seemed to have come to a halt.
>>>>
>>>>> The scripts will run fine when enough memory is
>>>>> available.  So, to deal with your problem, monitor memory usage on
>>>>> running jobs and follow good programming policies regarding memory
>>>>> usage.
>>>> If that means being frugal, removing unused objects and
>>>> preallocation of matrices I've tried my best. Adding some calls to
>>>> gc() seemed to improve the situation only slightly.
>>>>
>>>
>>> R does gc automatically when it's running out of memory, so that makes
>>> no real difference. Sometimes it's useful to code in local scope so
>>> objects can be collected automatically, but that's all very
>>> application-specific.
>>>
>>>
>>>>> Request larger memory resources if that is an option.  It is
>>>>> possible that R has a memory leak, but it is rather unlikely this is
>>>>> the problem. If you still have issues, you may want to provide some error messages
>>>>> and some sessionInfo() as well as some measure of memory usage.
>>>>
>>>> For memory issue, the message above is thrown. For other jobs, the
>>>> process just terminates without any more output just after having
>>>> read some large input files.
>>>> I agree that this is unlikely an R memory leak, however, I am trying
>>>> to find out what I can still do from my side or if I can point the
>>>> admin at some Torque configurations problems, which is what I
>>>> suspect.
>>>> Has anyone observed similar behaviour and knows a fix?
>>>>
>>>
>>> It's very easy to run out of memory with parallel jobs. In particular
>>> if you don't share data across the jobs, you'll end up using a lot of
>>> memory. People underestimate that aspect even though the math is
>>> simple - if you have let's say 128GB of RAM which sounds like a lot,
>>> but run 40 jobs, you'll end up with only ~3Gb per job which is likely
>>> not enough (at least not the jobs I'm running ;)). Note that things
>>> like parsing an input file can use quite a bit of memory - it's
>>> usually a good idea to run a pre-processing step that parses random
>>> files into binary objects or RData files which can be loaded much more
>>> efficiently.
>>
>> Thanks for this discussion - because these are exactly the symptoms I
>> experienced and could not make sense of (i.e. crashing R sessions on the
>> cluster, hanging nodes which needed to be restarted to work again) - as
>> I assumed that torque would protect the node from crashing due to much memory usage.
>
> Some clusters do have something in place to try to do this, but it is
> not a simple task to implement well since Torque is not really
> "responsible" for memory management once a job is running.
>
>> One point is mentioned here again and again: monitor memory usage. But
>> is there an easy way to do this? Can I submit a script to torque and get
>> back a memory report in a log file, which I can analyse to get memory
>> usage over time?
>
> You will probably need to talk to your cluster admins, but on our
> cluster, I simply login to a node and run "top".  Other clusters have
> dedicated monitoring tools.  Finally, some clusters have configured a
> job postscript that reports on job resource usage.  All of these
> issues are best dealt with by talking with cluster administrators
> since each cluster (even those running torque) are unique in some
> ways.

Yes - there is always the system level approach. I was more thinking
along the R approach - something along the lines of using R's memory
profiling (which I haven't used yet).

The advantage would be that one could (depending on the simulation) run
it once locally and get the memory requirements.

Rainer

>
> Sean
>
>> Rainer
>>
>>
>>>
>>> Anyway, first run just one job and watch its memory usage to see how
>>> it works. Linux typically cannot reclaim much memory back, so when
>>> it's done you should see roughly the physical memory footprint.
>>>
>>>
>>>> Thanks in advance,
>>>> Till
>>>>
>>>>
>>>> R version 2.12.1 (2010-12-16)
>>>
>>> Geeez... I didn't know such ancient versions still existed in the wild =)
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] graphics  grDevices datasets  stats     utils     methods   base
>>>>
>>>> other attached packages:
>>>> [1] Rmpi_0.5-9
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/
>>>>
>>>> _______________________________________________
>>>> R-sig-hpc mailing list
>>>> R-sig-hpc at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>>
>>>>
>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>>
>> --
>> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)
>>
>> Centre of Excellence for Invasion Biology
>> Stellenbosch University
>> South Africa
>>
>> Tel :       +33 - (0)9 53 10 27 44
>> Cell:       +33 - (0)6 85 62 59 98
>> Fax :       +33 - (0)9 58 10 27 44
>>
>> Fax (D):    +49 - (0)3 21 21 25 22 44
>>
>> email:      Rainer at krugs.de
>>
>> Skype:      RMkrug
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
<#secure method=pgpmime mode=sign>

-- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      Rainer at krugs.de

Skype:      RMkrug