[R-sig-hpc] snow, open mpi and "Connection to lifeline [[62364, 0], 0] lost"

Dirk Eddelbuettel edd at debian.org
Sun Jul 11 15:31:26 CEST 2010


Jonathan,

Sorry, meant to reply earlier but this fell to the wayside.

On 6 July 2010 at 10:11, Jonathan Greenberg wrote:
| HPCers:
| 
| I've been successfully running snow via Rmpi on some small tests I've
| been doing for this big modeling run I'm working towards, and when I
| finally initiated the run, I'm getting (after some time and several
| snow calls) the following error which quits out of R entirely:
| 
| [HOSTNAME:19967] [[62364,1],0] routed:binomial: Connection to lifeline
| [[62364,0],0] lost
| 
| This is on a Debian x64 system with 4 CPUs.
| 
| I've used both a debian install of openmpi plus a standard
| install.packages("snow"), as well as trying out the "sudo apt-get
| install r-cran-rmpi" approaches to running snow via Rmpi, and I'm
| getting the same behavior.  The code and input data are too complex to
| paste in here, but a couple of things I was thinking might be hints to
| what might be going on:
| 
| 1) I am currently initiating a snow/Rmpi cluster a SINGLE time via 			
| cl <- makeCluster(8, type = "MPI")
| [some main loop]
| [main body of code which has looping, lots of calls to cl using
| clusterMap and clusterApplyLB]
| [end main loop]
| stopCluster(cl)
| 
| 2) Many times, my call to the cluster is requesting 7 "slaves".  I
| configured the makeCluster statement with 8 slaves (2 x the number of
| physical processors) -- I wanted to "overload" the processors because
| slaves during a single clusterApplyLB() call is likely to be
| asymmetric in the amount of time its taking.
| 
| 3) This process makes it through several more or less identical loops
| before I get the "Connection to lifeline [[X,Y],Z] lost" error.
| 
| Thoughts?  Thanks!  I saw some mention in related errors about
| configuring some host file, but was unclear a) how this helps me, b)
| what I need to put in the file, and c) how do I get R to "read" the
| file (since it appears that R does not use this hostfile when firing
| up MPI.

I would try to simplify. Create a C/C++-only MPI program (e.g. the hello
world example in my 'Intro to HPC with R' slides) and see if you can run
that.  If so, add something to add a little load -- sum the squares of logs
of a million numbers or whatever -- and see if that works. If so, try a
simple Rmpi approach and then go from Rmpi to snow and Rmpi.

Hope this helps.

-- 
  Regards, Dirk



More information about the R-sig-hpc mailing list