[R-sig-hpc] doSNOW + foreach = embarrassingly frustrating computation

Marius Hofert m_hofert at web.de
Tue Dec 21 19:59:17 CET 2010


Hi all,

Martin Morgan responded off-list and pointed out that I might have used the wrong bsub-command. He suggested:
bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --vanilla -f minimal.R 
Since my installed packages were not found (due to --no-environ as part of --vanilla), I used:
bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --no-save -q -f minimal.R 
Below, please find all the outputs [ran under the same setup as before], with comments.
It seems like (2) and (6) almost solve the problem. But what does this "finalize" mean?

Cheers,

Marius


(1) First trial (check if MPI runs):

minimal example as given on http://math.acadiau.ca/ACMMaC/Rmpi/sample.html 

## ==== output (1) start ====

Sender: LSF System <lsfadmin at a6231>
Subject: Job 192910: <mpirun -n 1 R --no-save -q -f m01.R> Done

Job <mpirun -n 1 R --no-save -q -f m01.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:44:03 2010
Results reported at Tue Dec 21 19:44:19 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m01.R
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :     24.90 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

> ## from http://math.acadiau.ca/ACMMaC/Rmpi/sample.html
> 
> # Load the R MPI package if it is not already loaded.
> if (!is.loaded("mpi_initialize")) {
+     library("Rmpi")
+     }
>                                                                                 
> # Spawn as many slaves as possible
> mpi.spawn.Rslaves()
	4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: a6231 
slave1 (rank 1, comm 1) of size 5 is running on: a6231 
slave2 (rank 2, comm 1) of size 5 is running on: a6231 
slave3 (rank 3, comm 1) of size 5 is running on: a6231 
slave4 (rank 4, comm 1) of size 5 is running on: a6231 
>                                                                                 
> # In case R exits unexpectedly, have it automatically clean up
> # resources taken up by Rmpi (slaves, memory, etc...)
> .Last <- function(){
+     if (is.loaded("mpi_initialize")){
+         if (mpi.comm.size(1) > 0){
+             print("Please use mpi.close.Rslaves() to close slaves.")
+             mpi.close.Rslaves()
+         }
+         print("Please use mpi.quit() to quit R")
+         .Call("mpi_finalize")
+     }
+ }
> 
> # Tell all slaves to return a message identifying themselves
> mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
$slave1
[1] "I am 1 of 5"

$slave2
[1] "I am 2 of 5"

$slave3
[1] "I am 3 of 5"

$slave4
[1] "I am 4 of 5"

> 
> # Tell all slaves to close down, and exit the program
> mpi.close.Rslaves()
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          a6231.hpc-net.ethz.ch (PID 8966)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[1] 1
> mpi.quit()

## ==== output (1) end ====

=> now the there is no error anymore (only the warning (?))

(2) Second trial 

## ==== output (2) start ====

Sender: LSF System <lsfadmin at a6231>
Subject: Job 193052: <mpirun -n 1 R --no-save -q -f m02.R> Exited

Job <mpirun -n 1 R --no-save -q -f m02.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:39 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m02.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      7.20 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

> library(doSNOW) 
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
> library(Rmpi)
> library(rlecuyer)
> 
> cl <- makeCluster(3, type = "MPI") # create cluster object with the given number of slaves 
	3 slaves are spawned successfully. 0 failed.
> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
[1] "RNGstream"
> registerDoSNOW(cl) # register the cluster object with foreach
> ## start the work
> x <- foreach(i = 1:3) %dopar% { 
+    sqrt(i)
+ }
> x 
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

> stopCluster(cl) # properly shut down the cluster 
[1] 1
> 
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9048 on
node a6231.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (2) end ====

=> okay, the first hope: The calculations were done. But why "exit code 1"/finalize problem?

(3) Third trial 

## ==== output (3) start ====

Sender: LSF System <lsfadmin at a6204>
Subject: Job 193053: <mpirun -n 1 R --no-save -q -f m03.R> Exited

Job <mpirun -n 1 R --no-save -q -f m03.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6204>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:36 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m03.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      0.93 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

> library(doSNOW) 
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
> library(Rmpi)
> library(rlecuyer)
> 
> cl <- makeCluster() # create cluster object 
Error in makeMPIcluster(spec, ...) : no nodes available.
Calls: makeCluster -> makeMPIcluster
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9530 on
node a6204.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (3) end ====

(4) Fourth trial 

## ==== output (4) start ====

Sender: LSF System <lsfadmin at a6278>
Subject: Job 193056: <mpirun -n 1 R --no-save -q -f m04.R> Exited

Job <mpirun -n 1 R --no-save -q -f m04.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6278>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:37 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m04.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      1.01 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

> library(doSNOW) 
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
> library(Rmpi)
> library(rlecuyer)
> 
> cl <- makeMPIcluster() # create cluster object  
Error in makeMPIcluster() : no nodes available.
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9778 on
node a6278.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (4) end ====

=> now (3) and (4) run and stop, but with errors.

(5) Fifth trial 

## ==== output (5) start ====

Sender: LSF System <lsfadmin at a6244>
Subject: Job 193057: <mpirun -n 1 R --no-save -q -f m05.R> Exited

Job <mpirun -n 1 R --no-save -q -f m05.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6244>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:37 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m05.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      0.98 sec.
    Max Memory :         4 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

> library(doSNOW) 
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
> library(Rmpi)
> library(rlecuyer)
> 
> cl <- getMPIcluster() # get the MPI cluster
> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
Error in checkCluster(cl) : not a valid cluster
Calls: clusterSetupRNG ... clusterSetupRNGstream -> clusterApply -> staticClusterApply -> checkCluster
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9571 on
node a6244.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (5) end ====

(6) Sixth trial

## ==== output (6) start ====

Sender: LSF System <lsfadmin at a6266>
Subject: Job 193058: <mpirun -n 1 R --no-save -q -f m06.R> Exited

Job <mpirun -n 1 R --no-save -q -f m06.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6266>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:41 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m06.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      3.69 sec.
    Max Memory :         4 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

> library(doSNOW) 
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
> library(Rmpi)
> library(rlecuyer)
> 
> cl <- makeMPIcluster(3) # create cluster object with the given number of slaves 
	3 slaves are spawned successfully. 0 failed.
> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
[1] "RNGstream"
> registerDoSNOW(cl) # register the cluster object with foreach
> ## start the work
> x <- foreach(i = 1:3) %dopar% { 
+    sqrt(i)
+ }
> x 
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

> stopCluster(cl) # properly shut down the cluster
[1] 1
> 
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 24975 on
node a6266.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (6) end ====

=> similar to (2)


More information about the R-sig-hpc mailing list