[R-sig-hpc] snow, open mpi and "Connection to lifeline [[62364, 0], 0] lost"
Dirk Eddelbuettel
edd at debian.org
Sun Jul 11 15:31:26 CEST 2010
Jonathan,
Sorry, meant to reply earlier but this fell to the wayside.
On 6 July 2010 at 10:11, Jonathan Greenberg wrote:
| HPCers:
|
| I've been successfully running snow via Rmpi on some small tests I've
| been doing for this big modeling run I'm working towards, and when I
| finally initiated the run, I'm getting (after some time and several
| snow calls) the following error which quits out of R entirely:
|
| [HOSTNAME:19967] [[62364,1],0] routed:binomial: Connection to lifeline
| [[62364,0],0] lost
|
| This is on a Debian x64 system with 4 CPUs.
|
| I've used both a debian install of openmpi plus a standard
| install.packages("snow"), as well as trying out the "sudo apt-get
| install r-cran-rmpi" approaches to running snow via Rmpi, and I'm
| getting the same behavior. The code and input data are too complex to
| paste in here, but a couple of things I was thinking might be hints to
| what might be going on:
|
| 1) I am currently initiating a snow/Rmpi cluster a SINGLE time via
| cl <- makeCluster(8, type = "MPI")
| [some main loop]
| [main body of code which has looping, lots of calls to cl using
| clusterMap and clusterApplyLB]
| [end main loop]
| stopCluster(cl)
|
| 2) Many times, my call to the cluster is requesting 7 "slaves". I
| configured the makeCluster statement with 8 slaves (2 x the number of
| physical processors) -- I wanted to "overload" the processors because
| slaves during a single clusterApplyLB() call is likely to be
| asymmetric in the amount of time its taking.
|
| 3) This process makes it through several more or less identical loops
| before I get the "Connection to lifeline [[X,Y],Z] lost" error.
|
| Thoughts? Thanks! I saw some mention in related errors about
| configuring some host file, but was unclear a) how this helps me, b)
| what I need to put in the file, and c) how do I get R to "read" the
| file (since it appears that R does not use this hostfile when firing
| up MPI.
I would try to simplify. Create a C/C++-only MPI program (e.g. the hello
world example in my 'Intro to HPC with R' slides) and see if you can run
that. If so, add something to add a little load -- sum the squares of logs
of a million numbers or whatever -- and see if that works. If so, try a
simple Rmpi approach and then go from Rmpi to snow and Rmpi.
Hope this helps.
--
Regards, Dirk
More information about the R-sig-hpc
mailing list