[R-sig-hpc] difficulty spawning Rslaves
rdiaz02 at gmail.com
Wed Dec 30 12:02:08 CET 2009
This might not provide any useful info, but just in case: when running
Rmpi, a bunch of log files are temporarily created in the current
working directory. Sometimes, they contain a little bit more info than
"process in local group is dead".
And a couple of other checks:
1. After installing release 7.1.2, you of course recompiled Rmpi
against the new versions?
2. Before running R, do your usual lamboot routine and then:
2.1. lamexec C hostname
2.2 tping C N -c 2 (or anyother number after -c)
3. Inisde Rmpi, why do you use mpi.comm.free instead of just
mpi.close.Rslaves? For me, for instance, the following works reliably:
mpi.spawn.Rslaves(nslaves = 1)
mpi.spawn.Rslaves(nslaves = 4)
On Tue, Dec 29, 2009 at 3:37 PM, Allan Strand <stranda at cofc.edu> wrote:
> Thanks Dirk and Ramon.
> I tried Lam 7.1.2 and am still seeing the same type of behavior. Still
> searching for a solution, and will report back.
> On 12/28/2009 12:28 PM, Ramon Diaz-Uriarte wrote:
>> More along Dirk's comments: we currently have two clusters using LAM,
>> both Debian systems, one using v. 7.1.2 of LAM's release and the other
>> 7.1.1. In a current Ubuntu-based laptop, things are working with
>> release 7.1.2.
>> On Mon, Dec 28, 2009 at 5:14 PM, Dirk Eddelbuettel<edd at debian.org> wrote:
>>> On 23 December 2009 at 16:05, Allan Strand wrote:
>>> | My setup is on a cluster running 64bit FC. I have recently broken my
>>> | install Rmpi (and hence snow) by upgrading some very old versions of R,
>>> | lam/mpi, Rmpi, and snow (currently installed versions listed at the
>>> | bottom of this email). No doubt this is a problem with my Rmpi
>>> | but I'm having trouble seeing it.
>>> | I cannot seem to spawn more than a single slave (which is spawned on
>>> | master node)
>>> | e.g.:
>>> |> mpi.spawn.Rslaves(comm=1,nslaves=1)
>>> | 1 slaves are spawned successfully. 0 failed.
>>> | master (rank 0, comm 1) of size 2 is running on: node0
>>> | slave1 (rank 1, comm 1) of size 2 is running on: node0
>>> |> mpi.comm.free(comm=1)
>>> |  1
>>> |> mpi.spawn.Rslaves(comm=1,nslaves=2)
>>> | 2 slaves are spawned successfully. 0 failed.
>>> | Error in mpi.intercomm.merge(intercomm, 0, comm) :
>>> | MPI_Error_string: process in local group is dead
>>> | No doubt the answer is contained in the MPI_Error string, but I'm not
>>> | sure how to interpret it.
>>> | Thanks,
>>> | Allan
>>> | ===================================
>>> | Versions (all installed locally in my account with directory
>>> | ./configure settings)
>>> | R 2.10.1
>>> | LAM 7.1.4/MPI 2 C++/ROMIO - Indiana University
>>> For what it is worth, a looong time ago (two years? longer?) when I was
>>> helping Manual to get the Debian OpenMPI packages into and when I was
>>> transitioning off LAM, I had concluded that the very latest 7.1.X
>>> releases of
>>> LAM were broken for me. The system was a then-current Ubuntu system with
>>> LAM and OpenMPI packages compiled from Debian sources. Provided I
>>> LAM at 7.1.2 things would work, the newer ones would not.
>>> So I'd recommend either downgrading to the last LAM that worked for you,
>>> rather take the plunge and jump to Open MPI. The 1.3.* series is pretty
>>> already, and 1.4.0 is just around the corner.
>>> Just my $0.02. The problem may of course be entirely different.
>>> Three out of two people have difficulties with fractions.
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
> Allan Strand, Biology http://linum.cofc.edu
> College of Charleston Ph. (843) 953-9189
> Charleston, SC 29424 Fax (843) 953-9199
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
Phone: +34-91-732-8000 ext. 3019
More information about the R-sig-hpc