[R-sig-hpc] doSNOW + foreach = embarrassingly frustrating computation
Martin Morgan
mtmorgan at fhcrc.org
Tue Dec 21 20:40:40 CET 2010
On 12/21/2010 10:59 AM, Marius Hofert wrote:
> Hi all,
>
> Martin Morgan responded off-list and pointed out that I might have used the wrong bsub-command. He suggested:
> bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --vanilla -f minimal.R
> Since my installed packages were not found (due to --no-environ as part of --vanilla), I used:
I would confirm that your or the site's R environment file is not doing
anything unusual; I'm surprised that you need this set. More below...
> bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --no-save -q -f minimal.R
> Below, please find all the outputs [ran under the same setup as before], with comments.
> It seems like (2) and (6) almost solve the problem. But what does this "finalize" mean?
>
> Cheers,
>
> Marius
>
>
> (1) First trial (check if MPI runs):
>
> minimal example as given on http://math.acadiau.ca/ACMMaC/Rmpi/sample.html
>
> ## ==== output (1) start ====
>
> Sender: LSF System <lsfadmin at a6231>
> Subject: Job 192910: <mpirun -n 1 R --no-save -q -f m01.R> Done
>
> Job <mpirun -n 1 R --no-save -q -f m01.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:44:03 2010
> Results reported at Tue Dec 21 19:44:19 2010
>
> Your job looked like:
>
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m01.R
> ------------------------------------------------------------
>
> Successfully completed.
>
> Resource usage summary:
>
> CPU time : 24.90 sec.
> Max Memory : 3 MB
> Max Swap : 29 MB
>
> Max Processes : 1
> Max Threads : 1
>
> The output (if any) follows:
>
>> ## from http://math.acadiau.ca/ACMMaC/Rmpi/sample.html
>>
>> # Load the R MPI package if it is not already loaded.
>> if (!is.loaded("mpi_initialize")) {
> + library("Rmpi")
> + }
>>
>> # Spawn as many slaves as possible
>> mpi.spawn.Rslaves()
> 4 slaves are spawned successfully. 0 failed.
> master (rank 0, comm 1) of size 5 is running on: a6231
> slave1 (rank 1, comm 1) of size 5 is running on: a6231
> slave2 (rank 2, comm 1) of size 5 is running on: a6231
> slave3 (rank 3, comm 1) of size 5 is running on: a6231
> slave4 (rank 4, comm 1) of size 5 is running on: a6231
>>
>> # In case R exits unexpectedly, have it automatically clean up
>> # resources taken up by Rmpi (slaves, memory, etc...)
>> .Last <- function(){
> + if (is.loaded("mpi_initialize")){
> + if (mpi.comm.size(1) > 0){
> + print("Please use mpi.close.Rslaves() to close slaves.")
> + mpi.close.Rslaves()
> + }
> + print("Please use mpi.quit() to quit R")
> + .Call("mpi_finalize")
> + }
> + }
This part of the 'minimal' example doesn't seem minimal, I'd remove it,
but follow it's advice and conclude your scripts with
mpi.close.Rslaves()
mpi.quit()
>>
>> # Tell all slaves to return a message identifying themselves
>> mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
> $slave1
> [1] "I am 1 of 5"
>
> $slave2
> [1] "I am 2 of 5"
>
> $slave3
> [1] "I am 3 of 5"
>
> $slave4
> [1] "I am 4 of 5"
>
>>
>> # Tell all slaves to close down, and exit the program
>> mpi.close.Rslaves()
> --------------------------------------------------------------------------
> An MPI process has executed an operation involving a call to the
> "fork()" system call to create a child process. Open MPI is currently
> operating in a condition that could result in memory corruption or
> other system errors; your MPI job may hang, crash, or produce silent
> data corruption. The use of fork() (or system() or other calls that
> create child processes) is strongly discouraged.
>
> The process that invoked fork was:
>
> Local host: a6231.hpc-net.ethz.ch (PID 8966)
> MPI_COMM_WORLD rank: 0
>
> If you are *absolutely sure* that your application will successfully
> and correctly survive a call to fork(), you may disable this warning
> by setting the mpi_warn_on_fork MCA parameter to 0.
> --------------------------------------------------------------------------
> [1] 1
>> mpi.quit()
>
> ## ==== output (1) end ====
>
> => now the there is no error anymore (only the warning (?))
>
> (2) Second trial
>
> ## ==== output (2) start ====
>
> Sender: LSF System <lsfadmin at a6231>
> Subject: Job 193052: <mpirun -n 1 R --no-save -q -f m02.R> Exited
>
> Job <mpirun -n 1 R --no-save -q -f m02.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:39 2010
>
> Your job looked like:
>
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m02.R
> ------------------------------------------------------------
>
> Exited with exit code 1.
>
> Resource usage summary:
>
> CPU time : 7.20 sec.
> Max Memory : 3 MB
> Max Swap : 29 MB
>
> Max Processes : 1
> Max Threads : 1
>
> The output (if any) follows:
>
>> library(doSNOW)
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- makeCluster(3, type = "MPI") # create cluster object with the given number of slaves
> 3 slaves are spawned successfully. 0 failed.
>> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
> [1] "RNGstream"
>> registerDoSNOW(cl) # register the cluster object with foreach
>> ## start the work
>> x <- foreach(i = 1:3) %dopar% {
> + sqrt(i)
> + }
>> x
> [[1]]
> [1] 1
>
> [[2]]
> [1] 1.414214
>
> [[3]]
> [1] 1.732051
>
>> stopCluster(cl) # properly shut down the cluster
> [1] 1
>>
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 9048 on
> node a6231.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
here I think you are begin told to end your script with
mpi.quit()
>
> ## ==== output (2) end ====
>
> => okay, the first hope: The calculations were done. But why "exit code 1"/finalize problem?
>
> (3) Third trial
>
> ## ==== output (3) start ====
>
> Sender: LSF System <lsfadmin at a6204>
> Subject: Job 193053: <mpirun -n 1 R --no-save -q -f m03.R> Exited
>
> Job <mpirun -n 1 R --no-save -q -f m03.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6204>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:36 2010
>
> Your job looked like:
>
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m03.R
> ------------------------------------------------------------
>
> Exited with exit code 1.
>
> Resource usage summary:
>
> CPU time : 0.93 sec.
> Max Memory : 3 MB
> Max Swap : 29 MB
>
> Max Processes : 1
> Max Threads : 1
>
> The output (if any) follows:
>
>> library(doSNOW)
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- makeCluster() # create cluster object
> Error in makeMPIcluster(spec, ...) : no nodes available.
> Calls: makeCluster -> makeMPIcluster
> Execution halted
here snow is determining the size of the cluster with mpi.comm.size()
(which returns 0) whereas I think you want to do something like
n = mpi.universe.size()
cl = makeCluster(n, type="MPI")
likewise below. In some cases mpi.universe.size() uses a system call to
'lamnodes', which will fail on systems without a lamnodes command; the
cheap workaround is to create an executable file called lamnodes that
does nothing and is on your PATH.
Martin
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 9530 on
> node a6204.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> ## ==== output (3) end ====
>
> (4) Fourth trial
>
> ## ==== output (4) start ====
>
> Sender: LSF System <lsfadmin at a6278>
> Subject: Job 193056: <mpirun -n 1 R --no-save -q -f m04.R> Exited
>
> Job <mpirun -n 1 R --no-save -q -f m04.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6278>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:37 2010
>
> Your job looked like:
>
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m04.R
> ------------------------------------------------------------
>
> Exited with exit code 1.
>
> Resource usage summary:
>
> CPU time : 1.01 sec.
> Max Memory : 3 MB
> Max Swap : 29 MB
>
> Max Processes : 1
> Max Threads : 1
>
> The output (if any) follows:
>
>> library(doSNOW)
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- makeMPIcluster() # create cluster object
> Error in makeMPIcluster() : no nodes available.
> Execution halted
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 9778 on
> node a6278.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> ## ==== output (4) end ====
>
> => now (3) and (4) run and stop, but with errors.
>
> (5) Fifth trial
>
> ## ==== output (5) start ====
>
> Sender: LSF System <lsfadmin at a6244>
> Subject: Job 193057: <mpirun -n 1 R --no-save -q -f m05.R> Exited
>
> Job <mpirun -n 1 R --no-save -q -f m05.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6244>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:37 2010
>
> Your job looked like:
>
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m05.R
> ------------------------------------------------------------
>
> Exited with exit code 1.
>
> Resource usage summary:
>
> CPU time : 0.98 sec.
> Max Memory : 4 MB
> Max Swap : 29 MB
>
> Max Processes : 1
> Max Threads : 1
>
> The output (if any) follows:
>
>> library(doSNOW)
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- getMPIcluster() # get the MPI cluster
>> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
> Error in checkCluster(cl) : not a valid cluster
> Calls: clusterSetupRNG ... clusterSetupRNGstream -> clusterApply -> staticClusterApply -> checkCluster
> Execution halted
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 9571 on
> node a6244.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> ## ==== output (5) end ====
>
> (6) Sixth trial
>
> ## ==== output (6) start ====
>
> Sender: LSF System <lsfadmin at a6266>
> Subject: Job 193058: <mpirun -n 1 R --no-save -q -f m06.R> Exited
>
> Job <mpirun -n 1 R --no-save -q -f m06.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6266>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:41 2010
>
> Your job looked like:
>
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m06.R
> ------------------------------------------------------------
>
> Exited with exit code 1.
>
> Resource usage summary:
>
> CPU time : 3.69 sec.
> Max Memory : 4 MB
> Max Swap : 29 MB
>
> Max Processes : 1
> Max Threads : 1
>
> The output (if any) follows:
>
>> library(doSNOW)
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- makeMPIcluster(3) # create cluster object with the given number of slaves
> 3 slaves are spawned successfully. 0 failed.
>> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
> [1] "RNGstream"
>> registerDoSNOW(cl) # register the cluster object with foreach
>> ## start the work
>> x <- foreach(i = 1:3) %dopar% {
> + sqrt(i)
> + }
>> x
> [[1]]
> [1] 1
>
> [[2]]
> [1] 1.414214
>
> [[3]]
> [1] 1.732051
>
>> stopCluster(cl) # properly shut down the cluster
> [1] 1
>>
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 24975 on
> node a6266.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> ## ==== output (6) end ====
>
> => similar to (2)
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the R-sig-hpc
mailing list