[R-sig-hpc] Unreproducable crashes of R-instances on cluster running Torque

Thu May 2 16:22:08 CEST 2013

Hi, Till.

See below.

On Thu, May 2, 2013 at 9:46 AM, Till Francke <win at comets.de> wrote:
> Dear Sean,
> thanks for your suggestions in spite of my obscure descriptions. I'll try to
> clarify some points:
>
>
>> R messages about "memory allocation problems"
>> usually mean that your code is asking for more memory than is
>> available on the machine.
>
> I get things like
>         Error: cannot allocate vector of size 304.6 Mb
> However, the jobs are started with the Torque option
>         #PBS -l mem=3gb
> When I submit this job alone, everything works like a charm, so 3 gb seem to
> suffice, right? With 20 or more jobs, I get the memory message. I assumed
> Torque would only start a job if the ressources are available, is that a
> misconception?

Torque will start a job if it THINKS there is memory available.  If
you have told Torque that your job needs 3gb and it uses 6gb, Torque
will not know that (typically).  If a node has 16gb of RAM, Torque may
try to put 5 3gb jobs on the node and if each is using 6gb, you can
see how problems arise. Therefore, what you are describing seems
consistent with a job not having enough memory; "cannot allocate
vector..." is an "out-of-memory" error in R.

If you can ssh into the nodes while jobs are running, you can run
"top" to see memory usage for each process.  If you cannot do so,
double the mem request anyway.

>
>
>> By "crashing a node of the cluster", I
>> suspect you mean that the machine becomes unreachable; this is often
>> due to the machine swapping large blocks of memory (again, a memory
>> issue in user code).
>
> I cannot tell more precisely; the admin just told me he had to reboot this
> node. Before that, the entire queue-handling of Torque seemed to have come
> to a halt.

I'm not sure how hanging a node would halt an entire Torque cluster
unless the scheduler is running on a worker node (generally not a good
idea, but sometimes necessary to reduce cost).  However, having R hang
a node is a relatively common occurrence on clusters with limited node
memory relative to typical workloads.  I suspect that the memory
issues are related.  Again, I'd monitor memory usage in running
processes to make sure that you guess correctly.  For a shortcut,
simply double your Torque memory request to see if the issue is
resolved.

>
>> The scripts will run fine when enough memory is
>> available.  So, to deal with your problem, monitor memory usage on
>> running jobs and follow good programming policies regarding memory
>> usage.
>
> If that means being frugal, removing unused objects and preallocation of
> matrices I've tried my best. Adding some calls to gc() seemed to improve the
> situation only slightly.

Yes, you'll need to be careful to remove unused objects (using rm())
in addition to gc().  At the end of the day, though, you may just need
more resources as I noted above.

>
>> Request larger memory resources if that is an option.  It is
>> possible that R has a memory leak, but it is rather unlikely this is
>> the problem. If you still have issues, you may want to provide some error
>> messages
>> and some sessionInfo() as well as some measure of memory usage.
>
>
> For memory issue, the message above is thrown. For other jobs, the process
> just terminates without any more output just after having read some large
> input files.

You (or your admin) should have logs from the cluster that might be useful.

> I agree that this is unlikely an R memory leak, however, I am trying to find
> out what I can still do from my side or if I can point the admin at some
> Torque configurations problems, which is what I suspect.
> Has anyone observed similar behaviour and knows a fix?

I do not really suspect Torque configuration problems though I cannot
rule them out.  "Crashing" a node on the cluster by trying to allocate
large blocks of memory and then swapping is, in my experience, a
not-too-uncommon event.

> Thanks in advance,
> Till
>
>
> R version 2.12.1 (2010-12-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)

This is unrelated, but you should get your admin to update to a newer
version of R.  This version is 2+ years old.

Sean

> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] graphics  grDevices datasets  stats     utils     methods   base
>
> other attached packages:
> [1] Rmpi_0.5-9
>
>
>
>
>
> --
> Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc