[R-sig-hpc] R <--> TM <--> Snow <--> Rmpi <--> OpenMPI cluster cleanup

Ross Boylan ross at biostat.ucsf.edu
Wed Aug 26 06:29:24 CEST 2009


On Tue, 2009-08-25 at 20:50 -0500, Mark Mueller wrote:
> PROBLEM DEFINITION --
> 
> Host environment:
> 
> - AMD_64, 4xCPU, quad core
> - Ubuntu 9.04 64-bit
> - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect
> to the localhost via ssh to run local jobs) - manually downloaded source and
> compiled
> - Rmpi 0.5-7
> - TM 0.4
> - Snow 0.3-3
> - R 2.9.0
> 
> When executing the following command on the host:
> 
> $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R
> 
> the following results, yet the <some program>.R completes successfully:
> 
> "mpirun has exited due to process rank 0 with PID [some pid] on node
> [node name here] exiting without calling "finalize". This may have
> caused other processes in the application to be terminated by signals
> sent by mpirun (as reported here)."
> 
> CONFIGURATION STEPS TAKEN --
> 
> - The hostfile does not create a situation where the system is
> oversubscribed.  In this case, slots=4 and max-slots=5.
> 
> - The <some program>.R uses snow::activateCluster() and
> snow::deactivateCluster() in the appropriate places.  There are no
> other code elements that control MPI in the <some program>.R file.
FWIW, I use stopCluster(getMPIcluster()) on Debian Lenny (OpenMPI 1.2)
and that seems to work.  I have a feeling that might be an rmpi command
rather than a snow command, even though it's a snow session; maybe I
should shift to deactivateCluster.  On the other hand, maybe
deactivateCluster() doesn't quite shut down.

The system I'm using for this is inaccessible right now, and so I can't
easily check the details.

Ross



More information about the R-sig-hpc mailing list