[R-sig-hpc] R <--> TM <--> Snow <--> Rmpi <--> OpenMPI cluster cleanup

Dirk Eddelbuettel edd at debian.org
Wed Aug 26 04:16:39 CEST 2009


On 25 August 2009 at 20:50, Mark Mueller wrote:
| Host environment:
| - AMD_64, 4xCPU, quad core
| - Ubuntu 9.04 64-bit
| - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect
| to the localhost via ssh to run local jobs) - manually downloaded source and
| compiled
| - Rmpi 0.5-7
| - TM 0.4
| - Snow 0.3-3
| - R 2.9.0
| When executing the following command on the host:
| $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R
| the following results, yet the <some program>.R completes successfully:
| "mpirun has exited due to process rank 0 with PID [some pid] on node
| [node name here] exiting without calling "finalize". This may have
| caused other processes in the application to be terminated by signals
| sent by mpirun (as reported here)."

As I recall, between OpenMPI 1.2* and 1.3.* something changed so that it now
prefers to end jobs on a call to mpi.quit().  Witness this quick example:

   edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
   Hello, rank 1 size 2 on ron
   mpirun has exited due to process rank 1 with PID 19867 on
   node ron exiting without calling "finalize". This may
   have caused other processes in the application to be
   terminated by signals sent by mpirun (as reported here).
   Hello, rank 0 size 2 on ron
But if I put an mpi.quit() as last instruction in here, all is well:
   edd at ron:/tmp$ echo "mpi.quit()" >> mpiHelloWorld.r
   edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
   Hello, rank 0 size 2 Hello, rankon ron
    1 size 2 on ron
   edd at ron:/tmp$

As an aside, you may like using littler (sudo apt-get install littler) or
Rscript for your scripts instead of the old-school R CMD BATCH.

| - The hostfile does not create a situation where the system is
| oversubscribed.  In this case, slots=4 and max-slots=5.
| - The <some program>.R uses snow::activateCluster() and
| snow::deactivateCluster() in the appropriate places.  There are no
| other code elements that control MPI in the <some program>.R file.
| I am suspicious that since the R + TM program completes successfully,
| there is something in the Rmpi/Snow/OpenMPI layer that is not cleaning
| up the MPI environment properly.  This is problematic because any

Good diagnosis -- you almost got to mpi.quit() !  

As an aside, I really like running simple helloWorld.r programs jsut to
ensure that the setup is right. Small and simple, easier to analyse.

| shell scripts that issue the mpirun directive will capture an exit
| status of 1 (i.e. an "error") from the mpirun command, yet there does
| not seem to be anything present in the environment that would cause
| mpirun (OpenMPI) to encounter an error condition.  This "clouds" the
| successful exit status from the R CMD BATCH command.
| Are there any known aspects of these packages that have not fully
| implemented a complete cleanup routine for MPI implementations using
| OpenMPI?

I can't tell whether tm needs that or whether your calling script needs it --
but try adding the mpi.quit() and see if that helps.

Cheers, Dirk

| Any insight or assistance will be greatly appreciated.
| Sincerely,
| Mark
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Three out of two people have difficulties with fractions.

More information about the R-sig-hpc mailing list