[R-sig-hpc] R <--> TM <--> Snow <--> Rmpi <--> OpenMPI cluster cleanup
Dirk Eddelbuettel
edd at debian.org
Wed Aug 26 04:16:39 CEST 2009
Mark,
On 25 August 2009 at 20:50, Mark Mueller wrote:
| PROBLEM DEFINITION --
|
| Host environment:
|
| - AMD_64, 4xCPU, quad core
| - Ubuntu 9.04 64-bit
| - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect
| to the localhost via ssh to run local jobs) - manually downloaded source and
| compiled
| - Rmpi 0.5-7
| - TM 0.4
| - Snow 0.3-3
| - R 2.9.0
|
| When executing the following command on the host:
|
| $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R
|
| the following results, yet the <some program>.R completes successfully:
|
| "mpirun has exited due to process rank 0 with PID [some pid] on node
| [node name here] exiting without calling "finalize". This may have
| caused other processes in the application to be terminated by signals
| sent by mpirun (as reported here)."
As I recall, between OpenMPI 1.2* and 1.3.* something changed so that it now
prefers to end jobs on a call to mpi.quit(). Witness this quick example:
edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
Hello, rank 1 size 2 on ron
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 19867 on
node ron exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Hello, rank 0 size 2 on ron
But if I put an mpi.quit() as last instruction in here, all is well:
edd at ron:/tmp$ echo "mpi.quit()" >> mpiHelloWorld.r
edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
Hello, rank 0 size 2 Hello, rankon ron
1 size 2 on ron
edd at ron:/tmp$
As an aside, you may like using littler (sudo apt-get install littler) or
Rscript for your scripts instead of the old-school R CMD BATCH.
| CONFIGURATION STEPS TAKEN --
|
| - The hostfile does not create a situation where the system is
| oversubscribed. In this case, slots=4 and max-slots=5.
|
| - The <some program>.R uses snow::activateCluster() and
| snow::deactivateCluster() in the appropriate places. There are no
| other code elements that control MPI in the <some program>.R file.
|
| I am suspicious that since the R + TM program completes successfully,
| there is something in the Rmpi/Snow/OpenMPI layer that is not cleaning
| up the MPI environment properly. This is problematic because any
Good diagnosis -- you almost got to mpi.quit() !
As an aside, I really like running simple helloWorld.r programs jsut to
ensure that the setup is right. Small and simple, easier to analyse.
| shell scripts that issue the mpirun directive will capture an exit
| status of 1 (i.e. an "error") from the mpirun command, yet there does
| not seem to be anything present in the environment that would cause
| mpirun (OpenMPI) to encounter an error condition. This "clouds" the
| successful exit status from the R CMD BATCH command.
|
| Are there any known aspects of these packages that have not fully
| implemented a complete cleanup routine for MPI implementations using
| OpenMPI?
I can't tell whether tm needs that or whether your calling script needs it --
but try adding the mpi.quit() and see if that helps.
Cheers, Dirk
| Any insight or assistance will be greatly appreciated.
|
| Sincerely,
| Mark
|
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
--
Three out of two people have difficulties with fractions.
More information about the R-sig-hpc
mailing list