[R-sig-hpc] doSNOW + foreach = embarrassingly frustrating computation

Martin Morgan mtmorgan at fhcrc.org
Tue Dec 21 20:40:40 CET 2010


On 12/21/2010 10:59 AM, Marius Hofert wrote:
> Hi all,
> 
> Martin Morgan responded off-list and pointed out that I might have used the wrong bsub-command. He suggested:
> bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --vanilla -f minimal.R 
> Since my installed packages were not found (due to --no-environ as part of --vanilla), I used:

I would confirm that your or the site's R environment file is not doing
anything unusual; I'm surprised that you need this set. More below...

> bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --no-save -q -f minimal.R 
> Below, please find all the outputs [ran under the same setup as before], with comments.
> It seems like (2) and (6) almost solve the problem. But what does this "finalize" mean?
> 
> Cheers,
> 
> Marius
> 
> 
> (1) First trial (check if MPI runs):
> 
> minimal example as given on http://math.acadiau.ca/ACMMaC/Rmpi/sample.html 
> 
> ## ==== output (1) start ====
> 
> Sender: LSF System <lsfadmin at a6231>
> Subject: Job 192910: <mpirun -n 1 R --no-save -q -f m01.R> Done
> 
> Job <mpirun -n 1 R --no-save -q -f m01.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:44:03 2010
> Results reported at Tue Dec 21 19:44:19 2010
> 
> Your job looked like:
> 
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m01.R
> ------------------------------------------------------------
> 
> Successfully completed.
> 
> Resource usage summary:
> 
>     CPU time   :     24.90 sec.
>     Max Memory :         3 MB
>     Max Swap   :        29 MB
> 
>     Max Processes  :         1
>     Max Threads    :         1
> 
> The output (if any) follows:
> 
>> ## from http://math.acadiau.ca/ACMMaC/Rmpi/sample.html
>>
>> # Load the R MPI package if it is not already loaded.
>> if (!is.loaded("mpi_initialize")) {
> +     library("Rmpi")
> +     }
>>                                                                                 
>> # Spawn as many slaves as possible
>> mpi.spawn.Rslaves()
> 	4 slaves are spawned successfully. 0 failed.
> master (rank 0, comm 1) of size 5 is running on: a6231 
> slave1 (rank 1, comm 1) of size 5 is running on: a6231 
> slave2 (rank 2, comm 1) of size 5 is running on: a6231 
> slave3 (rank 3, comm 1) of size 5 is running on: a6231 
> slave4 (rank 4, comm 1) of size 5 is running on: a6231 
>>                                                                                 
>> # In case R exits unexpectedly, have it automatically clean up
>> # resources taken up by Rmpi (slaves, memory, etc...)
>> .Last <- function(){
> +     if (is.loaded("mpi_initialize")){
> +         if (mpi.comm.size(1) > 0){
> +             print("Please use mpi.close.Rslaves() to close slaves.")
> +             mpi.close.Rslaves()
> +         }
> +         print("Please use mpi.quit() to quit R")
> +         .Call("mpi_finalize")
> +     }
> + }

This part of the 'minimal' example doesn't seem minimal, I'd remove it,
but follow it's advice and conclude your scripts with

  mpi.close.Rslaves()
  mpi.quit()

>>
>> # Tell all slaves to return a message identifying themselves
>> mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
> $slave1
> [1] "I am 1 of 5"
> 
> $slave2
> [1] "I am 2 of 5"
> 
> $slave3
> [1] "I am 3 of 5"
> 
> $slave4
> [1] "I am 4 of 5"
> 
>>
>> # Tell all slaves to close down, and exit the program
>> mpi.close.Rslaves()
> --------------------------------------------------------------------------
> An MPI process has executed an operation involving a call to the
> "fork()" system call to create a child process.  Open MPI is currently
> operating in a condition that could result in memory corruption or
> other system errors; your MPI job may hang, crash, or produce silent
> data corruption.  The use of fork() (or system() or other calls that
> create child processes) is strongly discouraged.  
> 
> The process that invoked fork was:
> 
>   Local host:          a6231.hpc-net.ethz.ch (PID 8966)
>   MPI_COMM_WORLD rank: 0
> 
> If you are *absolutely sure* that your application will successfully
> and correctly survive a call to fork(), you may disable this warning
> by setting the mpi_warn_on_fork MCA parameter to 0.
> --------------------------------------------------------------------------
> [1] 1
>> mpi.quit()
> 
> ## ==== output (1) end ====
> 
> => now the there is no error anymore (only the warning (?))
> 
> (2) Second trial 
> 
> ## ==== output (2) start ====
> 
> Sender: LSF System <lsfadmin at a6231>
> Subject: Job 193052: <mpirun -n 1 R --no-save -q -f m02.R> Exited
> 
> Job <mpirun -n 1 R --no-save -q -f m02.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:39 2010
> 
> Your job looked like:
> 
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m02.R
> ------------------------------------------------------------
> 
> Exited with exit code 1.
> 
> Resource usage summary:
> 
>     CPU time   :      7.20 sec.
>     Max Memory :         3 MB
>     Max Swap   :        29 MB
> 
>     Max Processes  :         1
>     Max Threads    :         1
> 
> The output (if any) follows:
> 
>> library(doSNOW) 
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- makeCluster(3, type = "MPI") # create cluster object with the given number of slaves 
> 	3 slaves are spawned successfully. 0 failed.
>> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
> [1] "RNGstream"
>> registerDoSNOW(cl) # register the cluster object with foreach
>> ## start the work
>> x <- foreach(i = 1:3) %dopar% { 
> +    sqrt(i)
> + }
>> x 
> [[1]]
> [1] 1
> 
> [[2]]
> [1] 1.414214
> 
> [[3]]
> [1] 1.732051
> 
>> stopCluster(cl) # properly shut down the cluster 
> [1] 1
>>
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 9048 on
> node a6231.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------

here I think you are begin told to end your script with

  mpi.quit()


> 
> ## ==== output (2) end ====
> 
> => okay, the first hope: The calculations were done. But why "exit code 1"/finalize problem?
> 
> (3) Third trial 
> 
> ## ==== output (3) start ====
> 
> Sender: LSF System <lsfadmin at a6204>
> Subject: Job 193053: <mpirun -n 1 R --no-save -q -f m03.R> Exited
> 
> Job <mpirun -n 1 R --no-save -q -f m03.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6204>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:36 2010
> 
> Your job looked like:
> 
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m03.R
> ------------------------------------------------------------
> 
> Exited with exit code 1.
> 
> Resource usage summary:
> 
>     CPU time   :      0.93 sec.
>     Max Memory :         3 MB
>     Max Swap   :        29 MB
> 
>     Max Processes  :         1
>     Max Threads    :         1
> 
> The output (if any) follows:
> 
>> library(doSNOW) 
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- makeCluster() # create cluster object 
> Error in makeMPIcluster(spec, ...) : no nodes available.
> Calls: makeCluster -> makeMPIcluster
> Execution halted

here snow is determining the size of the cluster with mpi.comm.size()
(which returns 0) whereas I think you want to do something like

   n = mpi.universe.size()
   cl = makeCluster(n, type="MPI")

likewise below. In some cases mpi.universe.size() uses a system call to
'lamnodes', which will fail on systems without a lamnodes command; the
cheap workaround is to create an executable file called lamnodes that
does nothing and is on your PATH.

Martin

> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 9530 on
> node a6204.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------

> 
> ## ==== output (3) end ====
> 
> (4) Fourth trial 
> 
> ## ==== output (4) start ====
> 
> Sender: LSF System <lsfadmin at a6278>
> Subject: Job 193056: <mpirun -n 1 R --no-save -q -f m04.R> Exited
> 
> Job <mpirun -n 1 R --no-save -q -f m04.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6278>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:37 2010
> 
> Your job looked like:
> 
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m04.R
> ------------------------------------------------------------
> 
> Exited with exit code 1.
> 
> Resource usage summary:
> 
>     CPU time   :      1.01 sec.
>     Max Memory :         3 MB
>     Max Swap   :        29 MB
> 
>     Max Processes  :         1
>     Max Threads    :         1
> 
> The output (if any) follows:
> 
>> library(doSNOW) 
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- makeMPIcluster() # create cluster object  
> Error in makeMPIcluster() : no nodes available.
> Execution halted
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 9778 on
> node a6278.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> 
> ## ==== output (4) end ====
> 
> => now (3) and (4) run and stop, but with errors.
> 
> (5) Fifth trial 
> 
> ## ==== output (5) start ====
> 
> Sender: LSF System <lsfadmin at a6244>
> Subject: Job 193057: <mpirun -n 1 R --no-save -q -f m05.R> Exited
> 
> Job <mpirun -n 1 R --no-save -q -f m05.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6244>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:37 2010
> 
> Your job looked like:
> 
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m05.R
> ------------------------------------------------------------
> 
> Exited with exit code 1.
> 
> Resource usage summary:
> 
>     CPU time   :      0.98 sec.
>     Max Memory :         4 MB
>     Max Swap   :        29 MB
> 
>     Max Processes  :         1
>     Max Threads    :         1
> 
> The output (if any) follows:
> 
>> library(doSNOW) 
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- getMPIcluster() # get the MPI cluster
>> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
> Error in checkCluster(cl) : not a valid cluster
> Calls: clusterSetupRNG ... clusterSetupRNGstream -> clusterApply -> staticClusterApply -> checkCluster
> Execution halted
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 9571 on
> node a6244.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> 
> ## ==== output (5) end ====
> 
> (6) Sixth trial
> 
> ## ==== output (6) start ====
> 
> Sender: LSF System <lsfadmin at a6266>
> Subject: Job 193058: <mpirun -n 1 R --no-save -q -f m06.R> Exited
> 
> Job <mpirun -n 1 R --no-save -q -f m06.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
> Job was executed on host(s) <4*a6266>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
> </cluster/home/math/hofertj> was used as the home directory.
> </cluster/home/math/hofertj> was used as the working directory.
> Started at Tue Dec 21 19:49:28 2010
> Results reported at Tue Dec 21 19:49:41 2010
> 
> Your job looked like:
> 
> ------------------------------------------------------------
> # LSBATCH: User input
> mpirun -n 1 R --no-save -q -f m06.R
> ------------------------------------------------------------
> 
> Exited with exit code 1.
> 
> Resource usage summary:
> 
>     CPU time   :      3.69 sec.
>     Max Memory :         4 MB
>     Max Swap   :        29 MB
> 
>     Max Processes  :         1
>     Max Threads    :         1
> 
> The output (if any) follows:
> 
>> library(doSNOW) 
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: codetools
> Loading required package: snow
>> library(Rmpi)
>> library(rlecuyer)
>>
>> cl <- makeMPIcluster(3) # create cluster object with the given number of slaves 
> 	3 slaves are spawned successfully. 0 failed.
>> clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
> [1] "RNGstream"
>> registerDoSNOW(cl) # register the cluster object with foreach
>> ## start the work
>> x <- foreach(i = 1:3) %dopar% { 
> +    sqrt(i)
> + }
>> x 
> [[1]]
> [1] 1
> 
> [[2]]
> [1] 1.414214
> 
> [[3]]
> [1] 1.732051
> 
>> stopCluster(cl) # properly shut down the cluster
> [1] 1
>>
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 24975 on
> node a6266.hpc-net.ethz.ch exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> 
> ## ==== output (6) end ====
> 
> => similar to (2)


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the R-sig-hpc mailing list