[R-sig-hpc] R <--> TM <--> Snow <--> Rmpi <--> OpenMPI cluster cleanup

Mark Mueller mark.mueller at gmail.com
Wed Aug 26 03:50:42 CEST 2009


PROBLEM DEFINITION --

Host environment:

- AMD_64, 4xCPU, quad core
- Ubuntu 9.04 64-bit
- OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect
to the localhost via ssh to run local jobs) - manually downloaded source and
compiled
- Rmpi 0.5-7
- TM 0.4
- Snow 0.3-3
- R 2.9.0

When executing the following command on the host:

$ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R

the following results, yet the <some program>.R completes successfully:

"mpirun has exited due to process rank 0 with PID [some pid] on node
[node name here] exiting without calling "finalize". This may have
caused other processes in the application to be terminated by signals
sent by mpirun (as reported here)."

CONFIGURATION STEPS TAKEN --

- The hostfile does not create a situation where the system is
oversubscribed.  In this case, slots=4 and max-slots=5.

- The <some program>.R uses snow::activateCluster() and
snow::deactivateCluster() in the appropriate places.  There are no
other code elements that control MPI in the <some program>.R file.

I am suspicious that since the R + TM program completes successfully,
there is something in the Rmpi/Snow/OpenMPI layer that is not cleaning
up the MPI environment properly.  This is problematic because any
shell scripts that issue the mpirun directive will capture an exit
status of 1 (i.e. an "error") from the mpirun command, yet there does
not seem to be anything present in the environment that would cause
mpirun (OpenMPI) to encounter an error condition.  This "clouds" the
successful exit status from the R CMD BATCH command.

Are there any known aspects of these packages that have not fully
implemented a complete cleanup routine for MPI implementations using
OpenMPI?

Any insight or assistance will be greatly appreciated.

Sincerely,
Mark



More information about the R-sig-hpc mailing list