[R-sig-hpc] R <--> TM <--> Snow <--> Rmpi <--> OpenMPI cluster cleanup

Dirk Eddelbuettel edd at debian.org
Wed Aug 26 04:16:39 CEST 2009


Mark,

On 25 August 2009 at 20:50, Mark Mueller wrote:
| PROBLEM DEFINITION --
| 
| Host environment:
| 
| - AMD_64, 4xCPU, quad core
| - Ubuntu 9.04 64-bit
| - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect
| to the localhost via ssh to run local jobs) - manually downloaded source and
| compiled
| - Rmpi 0.5-7
| - TM 0.4
| - Snow 0.3-3
| - R 2.9.0
| 
| When executing the following command on the host:
| 
| $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R
| 
| the following results, yet the <some program>.R completes successfully:
| 
| "mpirun has exited due to process rank 0 with PID [some pid] on node
| [node name here] exiting without calling "finalize". This may have
| caused other processes in the application to be terminated by signals
| sent by mpirun (as reported here)."

As I recall, between OpenMPI 1.2* and 1.3.* something changed so that it now
prefers to end jobs on a call to mpi.quit().  Witness this quick example:

   edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
   Hello, rank 1 size 2 on ron
   --------------------------------------------------------------------------
   mpirun has exited due to process rank 1 with PID 19867 on
   node ron exiting without calling "finalize". This may
   have caused other processes in the application to be
   terminated by signals sent by mpirun (as reported here).
   --------------------------------------------------------------------------
   Hello, rank 0 size 2 on ron
   
But if I put an mpi.quit() as last instruction in here, all is well:
   
   edd at ron:/tmp$ echo "mpi.quit()" >> mpiHelloWorld.r
   edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
   Hello, rank 0 size 2 Hello, rankon ron
    1 size 2 on ron
   edd at ron:/tmp$

As an aside, you may like using littler (sudo apt-get install littler) or
Rscript for your scripts instead of the old-school R CMD BATCH.

| CONFIGURATION STEPS TAKEN --
| 
| - The hostfile does not create a situation where the system is
| oversubscribed.  In this case, slots=4 and max-slots=5.
| 
| - The <some program>.R uses snow::activateCluster() and
| snow::deactivateCluster() in the appropriate places.  There are no
| other code elements that control MPI in the <some program>.R file.
| 
| I am suspicious that since the R + TM program completes successfully,
| there is something in the Rmpi/Snow/OpenMPI layer that is not cleaning
| up the MPI environment properly.  This is problematic because any

Good diagnosis -- you almost got to mpi.quit() !  

As an aside, I really like running simple helloWorld.r programs jsut to
ensure that the setup is right. Small and simple, easier to analyse.

| shell scripts that issue the mpirun directive will capture an exit
| status of 1 (i.e. an "error") from the mpirun command, yet there does
| not seem to be anything present in the environment that would cause
| mpirun (OpenMPI) to encounter an error condition.  This "clouds" the
| successful exit status from the R CMD BATCH command.
| 
| Are there any known aspects of these packages that have not fully
| implemented a complete cleanup routine for MPI implementations using
| OpenMPI?

I can't tell whether tm needs that or whether your calling script needs it --
but try adding the mpi.quit() and see if that helps.

Cheers, Dirk

| Any insight or assistance will be greatly appreciated.
| 
| Sincerely,
| Mark
| 
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

-- 
Three out of two people have difficulties with fractions.



More information about the R-sig-hpc mailing list