[R-sig-hpc] Error in makeMPIcluster(spec, ...): how to get a minimal example for parallel computing with doSNOW to run?

Marius Hofert m_hofert at web.de
Sun Dec 19 09:54:08 CET 2010


Dear all,

I received the following feedback from the maintainers of the cluster "Brutus". 

## begin quotation 

All MPI processes are launched by mpirun when your job starts. Rmpi simply queries
those processes to find out their number and ranks. In other words, the R
"cluster" is already up and running when Rmpi (or doMPI, doSNOW) starts. That is
the reason you cannot spawn slaves or create a new cluster using
mpi.spawn.Rslaves(), startMPIcluster() or makeCluster().

The proof is that the "hello world" example runs without error on Brutus if you
remove the mpi.spawn.Rslaves() statement.

I believe the problem in the present case is that registerDoMPI(cl) and
registerDoSNOW(cl) require a "cluster" (cl) as argument. Unlike Rmpi, they are
unable to use the pre-existing (and unnamed) cluster started by mpirun.

From my point of view, this is a problem between doMPI/doSNOW and Rmpi. It has
nothing to do with bsub or mpirun.

## end quotation

Then I checked ?makeCluster again. I found an interesting paragraph under "Details":

## begin quotation

In MPI configurations where process spawning is not available and something like mpirun is used to start a master and a set of slaves the corresponding cluster will have been pre-constructed and can be obtained with getMPIcluster. It is also possible to obtain a reference to the running cluster using makeCluster or makeMPIcluster. In this case the count argument can be omitted; if it is supplied, it must equal the number of nodes in the cluster. This interface is still experimental and subject to change.

## end quotation

So my hope was that either getMPIcluster() or makeCluster() without the argument count would do the job. Unfortunately, it still does not seem to work. But maybe that is something you have encountered before. Here is the output:

## ==== begin snippet "getMPIcluster" ====

Sender: LSF System <lsfadmin at a6269>
Subject: Job 56574: <mpirun R --no-save -q -f minimal3.R> Done

Job <mpirun R --no-save -q -f minimal3.R> was submitted from host <brutus2> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6269>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Sun Dec 19 01:53:54 2010
Results reported at Sun Dec 19 01:54:02 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun R --no-save -q -f minimal3.R
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      6.27 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

master (rank 0, comm 1) of size 4 is running on: a6269 
slave1 (rank 1, comm 1) of size 4 is running on: a6269 
slave2 (rank 2, comm 1) of size 4 is running on: a6269 
slave3 (rank 3, comm 1) of size 4 is running on: a6269 
> library(doSNOW) # loads foreach, snow, and other packages
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
> library(Rmpi) # for default in makeCluster()
> library(rlecuyer) # for clusterSetupRNG
> Error in checkCluster(cl) : not a valid cluster
Calls: clusterSetupRNG ... clusterSetupRNGstream -> clusterApply -> staticClusterApply -> checkCluster
Error in checkCluster(cl) : not a valid cluster
Calls: %dopar% -> <Anonymous> -> clusterCall -> checkCluster
Error: object 'x' not found
cl <- getMPIcluster() # create cluster object with the given number of slaves 
> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
> registerDoSNOW(cl) # register the cluster object with foreach
> x <- foreach(i = 1:3) %dopar% { # simple test
+    sqrt(i)
+ }
> x 
> stopCluster(cl) # properly shut down the cluster 
> 
[1] "Please use mpi.close.Rslaves() to close slaves"
[1] "Please use mpi.quit() to quit R"

## ==== end snippet "getMPIcluster" ====

and

## ==== begin snippet "makeCluster()" ====

Sender: LSF System <lsfadmin at a6231>
Subject: Job 56564: <mpirun R --no-save -q -f minimal2.R> Exited

Job <mpirun R --no-save -q -f minimal2.R> was submitted from host <brutus2> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Sun Dec 19 01:52:48 2010
Results reported at Sun Dec 19 02:52:51 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun R --no-save -q -f minimal2.R
------------------------------------------------------------

TERM_RUNLIMIT: job killed after reaching LSF run time limit.
Exited with exit code 1.

Resource usage summary:

    CPU time   :  13877.23 sec.
    Max Memory :       207 MB
    Max Swap   :      1472 MB

    Max Processes  :         7
    Max Threads    :         8

The output (if any) follows:

master (rank 0, comm 1) of size 4 is running on: a6231 
slave1 (rank 1, comm 1) of size 4 is running on: a6231 
slave2 (rank 2, comm 1) of size 4 is running on: a6231 
slave3 (rank 3, comm 1) of size 4 is running on: a6231 
> library(doSNOW) # loads foreach, snow, and other packages
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
> library(Rmpi) # for default in makeCluster()
> library(rlecuyer) # for clusterSetupRNG
> cl <- makeCluster() # create cluster object with the given number of slaves 
> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
mpirun: Forwarding signal 12 to job
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 30107 on
node a6231.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== end snippet "makeCluster()" ====

Cheers,

Marius


More information about the R-sig-hpc mailing list