[R-sig-hpc] Unreproducable crashes of R-instances on cluster running Torque
Till Francke
win at comets.de
Thu May 2 15:46:21 CEST 2013
Dear Sean,
thanks for your suggestions in spite of my obscure descriptions. I'll try
to clarify some points:
> R messages about "memory allocation problems"
> usually mean that your code is asking for more memory than is
> available on the machine.
I get things like
Error: cannot allocate vector of size 304.6 Mb
However, the jobs are started with the Torque option
#PBS -l mem=3gb
When I submit this job alone, everything works like a charm, so 3 gb seem
to suffice, right? With 20 or more jobs, I get the memory message. I
assumed Torque would only start a job if the ressources are available, is
that a misconception?
> By "crashing a node of the cluster", I
> suspect you mean that the machine becomes unreachable; this is often
> due to the machine swapping large blocks of memory (again, a memory
> issue in user code).
I cannot tell more precisely; the admin just told me he had to reboot this
node. Before that, the entire queue-handling of Torque seemed to have come
to a halt.
> The scripts will run fine when enough memory is
> available. So, to deal with your problem, monitor memory usage on
> running jobs and follow good programming policies regarding memory
> usage.
If that means being frugal, removing unused objects and preallocation of
matrices I've tried my best. Adding some calls to gc() seemed to improve
the situation only slightly.
> Request larger memory resources if that is an option. It is
> possible that R has a memory leak, but it is rather unlikely this is
> the problem. If you still have issues, you may want to provide some
> error messages
> and some sessionInfo() as well as some measure of memory usage.
For memory issue, the message above is thrown. For other jobs, the process
just terminates without any more output just after having read some large
input files.
I agree that this is unlikely an R memory leak, however, I am trying to
find out what I can still do from my side or if I can point the admin at
some Torque configurations problems, which is what I suspect.
Has anyone observed similar behaviour and knows a fix?
Thanks in advance,
Till
R version 2.12.1 (2010-12-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] graphics grDevices datasets stats utils methods base
other attached packages:
[1] Rmpi_0.5-9
--
Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/
More information about the R-sig-hpc
mailing list