[R-sig-hpc] Unreproducable crashes of R-instances on cluster running Torque

Thu May 2 13:05:27 CEST 2013

On Thu, May 2, 2013 at 5:14 AM, Till Francke <win at comets.de> wrote:
> Dear List,
> I am a user of a Linux cluster running Torque.
> I want to run very "embarassingly parallel" R jobs (no worker interaction,
> no MPI/multicore, just simple replicates of a script with different
> arguments). Whenever I submit more than ~30 of these, I encounter problems:
> Some jobs run fine, others terminate with R-messages on memory allocation
> problems, or even finish without further output, sometimes crashing a node
> of the cluster. Any of these scripts run fine when started alone.
> My admin suggests this is a memory leak in R, however, I wonder if even that
> would be the case, if this should stall the cluster.
> Could anyone give me some advise how to address this, please?

Hi, Till.

You describe several problems rather vaguely, but I would suspect that
your problems are related to memory use of user code and not to a
memory leak in R.  R messages about "memory allocation problems"
usually mean that your code is asking for more memory than is
available on the machine.  By "crashing a node of the cluster", I
suspect you mean that the machine becomes unreachable; this is often
due to the machine swapping large blocks of memory (again, a memory
issue in user code).  The scripts will run fine when enough memory is
available.  So, to deal with your problem, monitor memory usage on
running jobs and follow good programming policies regarding memory
usage.  Request larger memory resources if that is an option.  It is
possible that R has a memory leak, but it is rather unlikely this is
the problem.

If you still have issues, you may want to provide some error messages
and some sessionInfo() as well as some measure of memory usage.

Sean