[R-sig-hpc] Unreproducable crashes of R-instances on cluster running Torque

Rainer M. Krug Rainer at krugs.de
Mon May 13 10:08:34 CEST 2013

Simon Urbanek <simon.urbanek at r-project.org> writes:

> On May 2, 2013, at 9:46 AM, Till Francke wrote:
>> Dear Sean,
>> thanks for your suggestions in spite of my obscure descriptions. I'll try to clarify some points:
>>> R messages about "memory allocation problems"
>>> usually mean that your code is asking for more memory than is
>>> available on the machine.
>> I get things like
>> 	Error: cannot allocate vector of size 304.6 Mb
>> However, the jobs are started with the Torque option
>> 	#PBS -l mem=3gb
>> When I submit this job alone, everything works like a charm, so 3 gb seem to suffice, right?
> No, the 300MB are *in addition* to all other memory allocated by R -
> probably very close to the 3Gb. Also note that mem is total memory
> over all, not per process, so some may get very little (I don't use
> Torque, though so this is just based on the docs).

If I remember correctly, memory fragmentation plays an important role
for R (still in version 3.0.0?), so that one continuous memory block
needs to be available to be used - otherwise one can get these error
messages even if enough memory is available, but fragmented in smaller
blocks (or does torque take care of memory fragmentation?)

>> With 20 or more jobs, I get the memory message. I assumed Torque
>> would only start a job if the ressources are available, is that a
>> misconception?
>>> By "crashing a node of the cluster", I
>>> suspect you mean that the machine becomes unreachable; this is often
>>> due to the machine swapping large blocks of memory (again, a memory
>>> issue in user code).
>> I cannot tell more precisely; the admin just told me he had to
>> reboot this node. Before that, the entire queue-handling of Torque
>> seemed to have come to a halt.
>>> The scripts will run fine when enough memory is
>>> available.  So, to deal with your problem, monitor memory usage on
>>> running jobs and follow good programming policies regarding memory
>>> usage.
>> If that means being frugal, removing unused objects and
>> preallocation of matrices I've tried my best. Adding some calls to
>> gc() seemed to improve the situation only slightly.
> R does gc automatically when it's running out of memory, so that makes
> no real difference. Sometimes it's useful to code in local scope so
> objects can be collected automatically, but that's all very
> application-specific.
>>> Request larger memory resources if that is an option.  It is
>>> possible that R has a memory leak, but it is rather unlikely this is
>>> the problem. If you still have issues, you may want to provide some error messages
>>> and some sessionInfo() as well as some measure of memory usage.
>> For memory issue, the message above is thrown. For other jobs, the
>> process just terminates without any more output just after having
>> read some large input files.
>> I agree that this is unlikely an R memory leak, however, I am trying
>> to find out what I can still do from my side or if I can point the
>> admin at some Torque configurations problems, which is what I
>> suspect.
>> Has anyone observed similar behaviour and knows a fix?
> It's very easy to run out of memory with parallel jobs. In particular
> if you don't share data across the jobs, you'll end up using a lot of
> memory. People underestimate that aspect even though the math is
> simple - if you have let's say 128GB of RAM which sounds like a lot,
> but run 40 jobs, you'll end up with only ~3Gb per job which is likely
> not enough (at least not the jobs I'm running ;)). Note that things
> like parsing an input file can use quite a bit of memory - it's
> usually a good idea to run a pre-processing step that parses random
> files into binary objects or RData files which can be loaded much more
> efficiently.

Thanks for this discussion - because these are exactly the symptoms I
experienced and could not make sense of (i.e. crashing R sessions on the
cluster, hanging nodes which needed to be restarted to work again) - as
I assumed that torque would protect the node from crashing due to much memory usage. 

One point is mentioned here again and again: monitor memory usage. But
is there an easy way to do this? Can I submit a script to torque and get
back a memory report in a log file, which I can analyse to get memory
usage over time?


> Anyway, first run just one job and watch its memory usage to see how
> it works. Linux typically cannot reclaim much memory back, so when
> it's done you should see roughly the physical memory footprint.
>> Thanks in advance,
>> Till
>> R version 2.12.1 (2010-12-16)
> Geeez... I didn't know such ancient versions still existed in the wild =)
> Cheers,
> Simon
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> attached base packages:
>> [1] graphics  grDevices datasets  stats     utils     methods   base
>> other attached packages:
>> [1] Rmpi_0.5-9
