[R-sig-hpc] Problems with Rmpi, openMPI and SEG in rocks linux

Wed Oct 20 22:03:09 CEST 2010

Dear list,

I am trying to launch an Rmpi job inside a SGE system and I am experiencing an unwanted behaviour. Up to a certain number of required slots the procedure works successfully. Beyond that certain number I obtain a segmetation fault in Rslaves.sh while spawning the slaves.

More in the details, I try to use a rocks linux cluster with SGE queue system. Let's say that I use the classical example:
http://math.acadiau.ca/ACMMaC/Rmpi/sample.html

I put that in a file called: provaMPI.R to which I add first line:
#!/usr/bin/Rscript

and modify with mpi.spawn.Rslaves(20)

Then I write the bash script (script.sh) to send to qsub:

#!/bin/sh
# Run using bash
#$ -S /bin/bash
#$ -N provaMPI.R
#$ -pe mpi 21
#$ -cwd
/opt/openmpi/bin/orterun -np 1 provaMPI.R

And finally send:
shell$ qsub script.sh

The cluster is set up to run up to 12 processes in each node. I  expect to see filled these slots greedily in the nodes selected by the queueing system, since this is what I see as the policy in the three Parallel Environments that are present in the cluster (mpi, mpich, lam). This is indeed the case if I use for example 12 slaves and require -pe mpi 13, the run execute fine and all output is as expected. But with the case 20 I obtain the segmentation fault. 

I report below the error messages together with the three Parallel Environments. I experience the same problems with all of them (with lam also some executable lamhalt and lamboot not found). 

I have not been able to make up a link between the number of slots required and the crashing issue. I have seen a successful case when I was requiring 12 slaves and 13 slots. But I find it hard to reproduce. I also observed cases among the successful ones in which slots where allocated in more than one node. Overall it seems to me quite a random behaviour with a stronger bias for the unsuccessul cases :( 

Any suggestion about a resolution of this issue would be very much appreciated.

Thank you,

Best regards,

Marco

$ qconf -sp lam
pe_name            lam
slots              128
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/gridengine/mpi/startlam.sh $pe_hostfile
stop_proc_args     /opt/gridengine/mpi/stoplam.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
[stuetzle at submit-1-0 ACOTSP]$ qconf -sp mpi
pe_name            mpi
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/gridengine/mpi/startmpi.sh $pe_hostfile
stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
[stuetzle at submit-1-0 ACOTSP]$ qconf -sp mpich
pe_name            mpich
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

In provaMPI.R.######

/usr/lib/R/library/Rmpi/Rslaves.sh: line 20: 19442 Segmentation fault      (core dumped) $R_HOME/bin/R --no-init-file --slave --no-save < $1 > $hn.$2.$$.log 2>&1
--------------------------------------------------------------------------
orterun has exited due to process rank 8 with PID 19435 on
node compute-1-2.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).
--------------------------------------------------------------------------
rm: cannot remove `/tmp/2840522.1.medium1/rsh': No such file or directory
/opt/gridengine/default/spool/compute-1-14/active_jobs/2840522.1/pe_hostfile
compute-1-14
compute-1-14
compute-1-14
compute-1-14
compute-1-14

In the log files with each node's name.

 *** caught segfault ***
address 0x9ff44d4, cause 'memory not mapped'

Traceback:
 1: .Call("mpi_initialize", PACKAGE = "Rmpi")
 2: f(libname, pkgname)
 3: firstlib(which.lib.loc, package)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: tryCatchList(expr, classes, parentenv, handlers)
 7: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste(prefix, "\n  ", sep = "")    }    else prefix <- "Error : "    msg <- paste(prefix, conditionMessage(e), "\n", sep = "")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {        cat(msg, file = stderr())        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error"))})
 8: try(firstlib(which.lib.loc, package))
 9: library(Rmpi, logical.return = TRUE)
aborting ...

Then all other nodes: 

Error in f(libname, pkgname) : ignoring SIGPIPE signal
Error in f(libname, pkgname) : ignoring SIGPIPE signal
Error in f(libname, pkgname) : ignoring SIGPIPE signal
Error in f(libname, pkgname) : ignoring SIGPIPE signal

--
Marco Chiarandini, PhD
Department of Mathematics and Computer Science,               
University of Southern Denmark
Campusvej 55, DK-5230 Odense M, Denmark 
marco at imada.sdu.dk,  http://www.imada.sdu.dk/~marco
Phone: +45 6550 4031,  Fax: +45 6550 2325