[R-sig-hpc] Unreproducable crashes of R-instances on cluster running Torque

Mon May 13 12:31:15 CEST 2013

On Mon, May 13, 2013 at 4:08 AM, Rainer M. Krug <Rainer at krugs.de> wrote:
> Simon Urbanek <simon.urbanek at r-project.org> writes:
>
>> On May 2, 2013, at 9:46 AM, Till Francke wrote:
>>
>>> Dear Sean,
>>> thanks for your suggestions in spite of my obscure descriptions. I'll try to clarify some points:
>>>
>>>> R messages about "memory allocation problems"
>>>> usually mean that your code is asking for more memory than is
>>>> available on the machine.
>>> I get things like
>>>      Error: cannot allocate vector of size 304.6 Mb
>>> However, the jobs are started with the Torque option
>>>      #PBS -l mem=3gb
>>> When I submit this job alone, everything works like a charm, so 3 gb seem to suffice, right?
>>
>> No, the 300MB are *in addition* to all other memory allocated by R -
>> probably very close to the 3Gb. Also note that mem is total memory
>> over all, not per process, so some may get very little (I don't use
>> Torque, though so this is just based on the docs).
>
> If I remember correctly, memory fragmentation plays an important role
> for R (still in version 3.0.0?), so that one continuous memory block
> needs to be available to be used - otherwise one can get these error
> messages even if enough memory is available, but fragmented in smaller
> blocks (or does torque take care of memory fragmentation?)

Torque is a batch system.  The underlying OS (typically linux) is
responsible for memory management.

>>
>>
>>> With 20 or more jobs, I get the memory message. I assumed Torque
>>> would only start a job if the ressources are available, is that a
>>> misconception?
>>>
>>>
>>>> By "crashing a node of the cluster", I
>>>> suspect you mean that the machine becomes unreachable; this is often
>>>> due to the machine swapping large blocks of memory (again, a memory
>>>> issue in user code).
>>> I cannot tell more precisely; the admin just told me he had to
>>> reboot this node. Before that, the entire queue-handling of Torque
>>> seemed to have come to a halt.
>>>
>>>> The scripts will run fine when enough memory is
>>>> available.  So, to deal with your problem, monitor memory usage on
>>>> running jobs and follow good programming policies regarding memory
>>>> usage.
>>> If that means being frugal, removing unused objects and
>>> preallocation of matrices I've tried my best. Adding some calls to
>>> gc() seemed to improve the situation only slightly.
>>>
>>
>> R does gc automatically when it's running out of memory, so that makes
>> no real difference. Sometimes it's useful to code in local scope so
>> objects can be collected automatically, but that's all very
>> application-specific.
>>
>>
>>>> Request larger memory resources if that is an option.  It is
>>>> possible that R has a memory leak, but it is rather unlikely this is
>>>> the problem. If you still have issues, you may want to provide some error messages
>>>> and some sessionInfo() as well as some measure of memory usage.
>>>
>>> For memory issue, the message above is thrown. For other jobs, the
>>> process just terminates without any more output just after having
>>> read some large input files.
>>> I agree that this is unlikely an R memory leak, however, I am trying
>>> to find out what I can still do from my side or if I can point the
>>> admin at some Torque configurations problems, which is what I
>>> suspect.
>>> Has anyone observed similar behaviour and knows a fix?
>>>
>>
>> It's very easy to run out of memory with parallel jobs. In particular
>> if you don't share data across the jobs, you'll end up using a lot of
>> memory. People underestimate that aspect even though the math is
>> simple - if you have let's say 128GB of RAM which sounds like a lot,
>> but run 40 jobs, you'll end up with only ~3Gb per job which is likely
>> not enough (at least not the jobs I'm running ;)). Note that things
>> like parsing an input file can use quite a bit of memory - it's
>> usually a good idea to run a pre-processing step that parses random
>> files into binary objects or RData files which can be loaded much more
>> efficiently.
>
> Thanks for this discussion - because these are exactly the symptoms I
> experienced and could not make sense of (i.e. crashing R sessions on the
> cluster, hanging nodes which needed to be restarted to work again) - as
> I assumed that torque would protect the node from crashing due to much memory usage.

Some clusters do have something in place to try to do this, but it is
not a simple task to implement well since Torque is not really
"responsible" for memory management once a job is running.

> One point is mentioned here again and again: monitor memory usage. But
> is there an easy way to do this? Can I submit a script to torque and get
> back a memory report in a log file, which I can analyse to get memory
> usage over time?

You will probably need to talk to your cluster admins, but on our
cluster, I simply login to a node and run "top".  Other clusters have
dedicated monitoring tools.  Finally, some clusters have configured a
job postscript that reports on job resource usage.  All of these
issues are best dealt with by talking with cluster administrators
since each cluster (even those running torque) are unique in some
ways.

Sean

> Rainer
>
>
>>
>> Anyway, first run just one job and watch its memory usage to see how
>> it works. Linux typically cannot reclaim much memory back, so when
>> it's done you should see roughly the physical memory footprint.
>>
>>
>>> Thanks in advance,
>>> Till
>>>
>>>
>>> R version 2.12.1 (2010-12-16)
>>
>> Geeez... I didn't know such ancient versions still existed in the wild =)
>>
>> Cheers,
>> Simon
>>
>>
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] graphics  grDevices datasets  stats     utils     methods   base
>>>
>>> other attached packages:
>>> [1] Rmpi_0.5-9
>>>
>>>
>>>
>>>
>>> --
>>> Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/
>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>
>>>
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
> --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)
>
> Centre of Excellence for Invasion Biology
> Stellenbosch University
> South Africa
>
> Tel :       +33 - (0)9 53 10 27 44
> Cell:       +33 - (0)6 85 62 59 98
> Fax :       +33 - (0)9 58 10 27 44
>
> Fax (D):    +49 - (0)3 21 21 25 22 44
>
> email:      Rainer at krugs.de
>
> Skype:      RMkrug
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc