[R-sig-hpc] Unreproducable crashes of R-instances on cluster running Torque
Rainer M. Krug
Rainer at krugs.de
Mon May 13 10:08:34 CEST 2013
Simon Urbanek <simon.urbanek at r-project.org> writes:
> On May 2, 2013, at 9:46 AM, Till Francke wrote:
>> Dear Sean,
>> thanks for your suggestions in spite of my obscure descriptions. I'll try to clarify some points:
>>> R messages about "memory allocation problems"
>>> usually mean that your code is asking for more memory than is
>>> available on the machine.
>> I get things like
>> Error: cannot allocate vector of size 304.6 Mb
>> However, the jobs are started with the Torque option
>> #PBS -l mem=3gb
>> When I submit this job alone, everything works like a charm, so 3 gb seem to suffice, right?
> No, the 300MB are *in addition* to all other memory allocated by R -
> probably very close to the 3Gb. Also note that mem is total memory
> over all, not per process, so some may get very little (I don't use
> Torque, though so this is just based on the docs).
If I remember correctly, memory fragmentation plays an important role
for R (still in version 3.0.0?), so that one continuous memory block
needs to be available to be used - otherwise one can get these error
messages even if enough memory is available, but fragmented in smaller
blocks (or does torque take care of memory fragmentation?)
>> With 20 or more jobs, I get the memory message. I assumed Torque
>> would only start a job if the ressources are available, is that a
>>> By "crashing a node of the cluster", I
>>> suspect you mean that the machine becomes unreachable; this is often
>>> due to the machine swapping large blocks of memory (again, a memory
>>> issue in user code).
>> I cannot tell more precisely; the admin just told me he had to
>> reboot this node. Before that, the entire queue-handling of Torque
>> seemed to have come to a halt.
>>> The scripts will run fine when enough memory is
>>> available. So, to deal with your problem, monitor memory usage on
>>> running jobs and follow good programming policies regarding memory
>> If that means being frugal, removing unused objects and
>> preallocation of matrices I've tried my best. Adding some calls to
>> gc() seemed to improve the situation only slightly.
> R does gc automatically when it's running out of memory, so that makes
> no real difference. Sometimes it's useful to code in local scope so
> objects can be collected automatically, but that's all very
>>> Request larger memory resources if that is an option. It is
>>> possible that R has a memory leak, but it is rather unlikely this is
>>> the problem. If you still have issues, you may want to provide some error messages
>>> and some sessionInfo() as well as some measure of memory usage.
>> For memory issue, the message above is thrown. For other jobs, the
>> process just terminates without any more output just after having
>> read some large input files.
>> I agree that this is unlikely an R memory leak, however, I am trying
>> to find out what I can still do from my side or if I can point the
>> admin at some Torque configurations problems, which is what I
>> Has anyone observed similar behaviour and knows a fix?
> It's very easy to run out of memory with parallel jobs. In particular
> if you don't share data across the jobs, you'll end up using a lot of
> memory. People underestimate that aspect even though the math is
> simple - if you have let's say 128GB of RAM which sounds like a lot,
> but run 40 jobs, you'll end up with only ~3Gb per job which is likely
> not enough (at least not the jobs I'm running ;)). Note that things
> like parsing an input file can use quite a bit of memory - it's
> usually a good idea to run a pre-processing step that parses random
> files into binary objects or RData files which can be loaded much more
Thanks for this discussion - because these are exactly the symptoms I
experienced and could not make sense of (i.e. crashing R sessions on the
cluster, hanging nodes which needed to be restarted to work again) - as
I assumed that torque would protect the node from crashing due to much memory usage.
One point is mentioned here again and again: monitor memory usage. But
is there an easy way to do this? Can I submit a script to torque and get
back a memory report in a log file, which I can analyse to get memory
usage over time?
> Anyway, first run just one job and watch its memory usage to see how
> it works. Linux typically cannot reclaim much memory back, so when
> it's done you should see roughly the physical memory footprint.
>> Thanks in advance,
>> R version 2.12.1 (2010-12-16)
> Geeez... I didn't know such ancient versions still existed in the wild =)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>  LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>>  LC_PAPER=en_US.UTF-8 LC_NAME=C
>>  LC_ADDRESS=C LC_TELEPHONE=C
>>  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> attached base packages:
>>  graphics grDevices datasets stats utils methods base
>> other attached packages:
>>  Rmpi_0.5-9
>> Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)
Centre of Excellence for Invasion Biology
Tel : +33 - (0)9 53 10 27 44
Cell: +33 - (0)6 85 62 59 98
Fax : +33 - (0)9 58 10 27 44
Fax (D): +49 - (0)3 21 21 25 22 44
email: Rainer at krugs.de
More information about the R-sig-hpc