[R-sig-hpc] Problems with Rmpi, openMPI and SEG in rocks linux
Marco Chiarandini
marco at imada.sdu.dk
Wed Oct 20 22:03:09 CEST 2010
Dear list,
I am trying to launch an Rmpi job inside a SGE system and I am experiencing an unwanted behaviour. Up to a certain number of required slots the procedure works successfully. Beyond that certain number I obtain a segmetation fault in Rslaves.sh while spawning the slaves.
More in the details, I try to use a rocks linux cluster with SGE queue system. Let's say that I use the classical example:
http://math.acadiau.ca/ACMMaC/Rmpi/sample.html
I put that in a file called: provaMPI.R to which I add first line:
#!/usr/bin/Rscript
and modify with mpi.spawn.Rslaves(20)
Then I write the bash script (script.sh) to send to qsub:
#!/bin/sh
# Run using bash
#$ -S /bin/bash
#$ -N provaMPI.R
#$ -pe mpi 21
#$ -cwd
/opt/openmpi/bin/orterun -np 1 provaMPI.R
And finally send:
shell$ qsub script.sh
The cluster is set up to run up to 12 processes in each node. I expect to see filled these slots greedily in the nodes selected by the queueing system, since this is what I see as the policy in the three Parallel Environments that are present in the cluster (mpi, mpich, lam). This is indeed the case if I use for example 12 slaves and require -pe mpi 13, the run execute fine and all output is as expected. But with the case 20 I obtain the segmentation fault.
I report below the error messages together with the three Parallel Environments. I experience the same problems with all of them (with lam also some executable lamhalt and lamboot not found).
I have not been able to make up a link between the number of slots required and the crashing issue. I have seen a successful case when I was requiring 12 slaves and 13 slots. But I find it hard to reproduce. I also observed cases among the successful ones in which slots where allocated in more than one node. Overall it seems to me quite a random behaviour with a stronger bias for the unsuccessul cases :(
Any suggestion about a resolution of this issue would be very much appreciated.
Thank you,
Best regards,
Marco
$ qconf -sp lam
pe_name lam
slots 128
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startlam.sh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stoplam.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
[stuetzle at submit-1-0 ACOTSP]$ qconf -sp mpi
pe_name mpi
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
[stuetzle at submit-1-0 ACOTSP]$ qconf -sp mpich
pe_name mpich
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
In provaMPI.R.######
/usr/lib/R/library/Rmpi/Rslaves.sh: line 20: 19442 Segmentation fault (core dumped) $R_HOME/bin/R --no-init-file --slave --no-save < $1 > $hn.$2.$$.log 2>&1
--------------------------------------------------------------------------
orterun has exited due to process rank 8 with PID 19435 on
node compute-1-2.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).
--------------------------------------------------------------------------
rm: cannot remove `/tmp/2840522.1.medium1/rsh': No such file or directory
/opt/gridengine/default/spool/compute-1-14/active_jobs/2840522.1/pe_hostfile
compute-1-14
compute-1-14
compute-1-14
compute-1-14
compute-1-14
In the log files with each node's name.
*** caught segfault ***
address 0x9ff44d4, cause 'memory not mapped'
Traceback:
1: .Call("mpi_initialize", PACKAGE = "Rmpi")
2: f(libname, pkgname)
3: firstlib(which.lib.loc, package)
4: doTryCatch(return(expr), name, parentenv, handler)
5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
6: tryCatchList(expr, classes, parentenv, handlers)
7: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste(prefix, "\n ", sep = "") } else prefix <- "Error : " msg <- paste(prefix, conditionMessage(e), "\n", sep = "") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error"))})
8: try(firstlib(which.lib.loc, package))
9: library(Rmpi, logical.return = TRUE)
aborting ...
Then all other nodes:
Error in f(libname, pkgname) : ignoring SIGPIPE signal
Error in f(libname, pkgname) : ignoring SIGPIPE signal
Error in f(libname, pkgname) : ignoring SIGPIPE signal
Error in f(libname, pkgname) : ignoring SIGPIPE signal
--
Marco Chiarandini, PhD
Department of Mathematics and Computer Science,
University of Southern Denmark
Campusvej 55, DK-5230 Odense M, Denmark
marco at imada.sdu.dk, http://www.imada.sdu.dk/~marco
Phone: +45 6550 4031, Fax: +45 6550 2325
More information about the R-sig-hpc
mailing list