[R-sig-hpc] snow, open mpi and "Connection to lifeline [[62364, 0], 0] lost"

Jonathan Greenberg greenberg at ucdavis.edu
Tue Jul 6 19:11:55 CEST 2010


I've been successfully running snow via Rmpi on some small tests I've
been doing for this big modeling run I'm working towards, and when I
finally initiated the run, I'm getting (after some time and several
snow calls) the following error which quits out of R entirely:

[HOSTNAME:19967] [[62364,1],0] routed:binomial: Connection to lifeline
[[62364,0],0] lost

This is on a Debian x64 system with 4 CPUs.

I've used both a debian install of openmpi plus a standard
install.packages("snow"), as well as trying out the "sudo apt-get
install r-cran-rmpi" approaches to running snow via Rmpi, and I'm
getting the same behavior.  The code and input data are too complex to
paste in here, but a couple of things I was thinking might be hints to
what might be going on:

1) I am currently initiating a snow/Rmpi cluster a SINGLE time via 			
cl <- makeCluster(8, type = "MPI")
[some main loop]
[main body of code which has looping, lots of calls to cl using
clusterMap and clusterApplyLB]
[end main loop]

2) Many times, my call to the cluster is requesting 7 "slaves".  I
configured the makeCluster statement with 8 slaves (2 x the number of
physical processors) -- I wanted to "overload" the processors because
slaves during a single clusterApplyLB() call is likely to be
asymmetric in the amount of time its taking.

3) This process makes it through several more or less identical loops
before I get the "Connection to lifeline [[X,Y],Z] lost" error.

Thoughts?  Thanks!  I saw some mention in related errors about
configuring some host file, but was unclear a) how this helps me, b)
what I need to put in the file, and c) how do I get R to "read" the
file (since it appears that R does not use this hostfile when firing
up MPI.


