[R-sig-hpc] Unreproducable crashes of R-instances on cluster running Torque
Till Francke
win at comets.de
Thu May 2 11:14:16 CEST 2013
Dear List,
I am a user of a Linux cluster running Torque.
I want to run very "embarassingly parallel" R jobs (no worker interaction,
no MPI/multicore, just simple replicates of a script with different
arguments). Whenever I submit more than ~30 of these, I encounter
problems: Some jobs run fine, others terminate with R-messages on memory
allocation problems, or even finish without further output, sometimes
crashing a node of the cluster. Any of these scripts run fine when started
alone.
My admin suggests this is a memory leak in R, however, I wonder if even
that would be the case, if this should stall the cluster.
Could anyone give me some advise how to address this, please?
Thanks,
Till
Scientific Linux SL release 5.5 (Boron)
Linux head 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:26:03 EST 2013 x86_64
x86_64 x86_64 GNU/Linux
R version 2.12.1 (2010-12-16)
--
Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/
More information about the R-sig-hpc
mailing list