[R-sig-hpc] difficulty spawning Rslaves

Ramon Diaz-Uriarte rdiaz02 at gmail.com
Wed Dec 30 12:02:08 CET 2009


This might not provide any useful info, but just in case: when running
Rmpi, a bunch of log files are temporarily created in the current
working directory. Sometimes, they contain a little bit more info than
"process in local group is dead".

And a couple of other checks:

1. After installing release 7.1.2, you of course recompiled Rmpi
against the new versions?


2. Before running R, do your usual lamboot routine and then:

2.1. lamexec C hostname

2.2 tping C N -c 2 (or anyother number after -c)


3. Inisde Rmpi, why do you use mpi.comm.free instead of just
mpi.close.Rslaves? For me, for instance, the following works reliably:

library(Rmpi)
mpi.spawn.Rslaves(nslaves = 1)
mpi.close.Rslaves()
mpi.spawn.Rslaves(nslaves = 4)


Best,

R.


On Tue, Dec 29, 2009 at 3:37 PM, Allan Strand <stranda at cofc.edu> wrote:
> Thanks Dirk and Ramon.
>
> I tried Lam 7.1.2 and am still seeing the same type of behavior.  Still
> searching for a solution, and will report back.
>
> cheers,
> a.
>
> On 12/28/2009 12:28 PM, Ramon Diaz-Uriarte wrote:
>>
>> More along Dirk's comments: we currently have two clusters using LAM,
>> both Debian systems, one using v. 7.1.2 of LAM's release and the other
>> 7.1.1. In a current Ubuntu-based laptop, things are working with
>> release 7.1.2.
>>
>> Best,
>>
>> R.
>>
>> On Mon, Dec 28, 2009 at 5:14 PM, Dirk Eddelbuettel<edd at debian.org>  wrote:
>>
>>>
>>> Allan,
>>>
>>> On 23 December 2009 at 16:05, Allan Strand wrote:
>>> | My setup is on a cluster running 64bit FC.  I have recently broken my
>>> | install Rmpi (and hence snow) by upgrading some very old versions of R,
>>> | lam/mpi, Rmpi, and snow (currently installed versions listed at the
>>> | bottom of this email).  No doubt this is a problem with my Rmpi
>>> install,
>>> | but I'm having trouble seeing it.
>>> |
>>> | I cannot seem to spawn more than a single slave (which is spawned on
>>> the
>>> | master node)
>>> | e.g.:
>>> |
>>> |>  mpi.spawn.Rslaves(comm=1,nslaves=1)
>>> |      1 slaves are spawned successfully. 0 failed.
>>> | master (rank 0, comm 1) of size 2 is running on: node0
>>> | slave1 (rank 1, comm 1) of size 2 is running on: node0
>>> |
>>> |>  mpi.comm.free(comm=1)
>>> | [1] 1
>>> |
>>> |>  mpi.spawn.Rslaves(comm=1,nslaves=2)
>>> |      2 slaves are spawned successfully. 0 failed.
>>> | Error in mpi.intercomm.merge(intercomm, 0, comm) :
>>> |    MPI_Error_string: process in local group is dead
>>> |
>>> | No doubt the answer is contained in the MPI_Error string, but I'm not
>>> | sure how to interpret it.
>>> |
>>> | Thanks,
>>> | Allan
>>> | ===================================
>>> | Versions (all installed locally in my account with directory
>>> appropriate
>>> | ./configure settings)
>>> |
>>> | R 2.10.1
>>> | LAM 7.1.4/MPI 2 C++/ROMIO - Indiana University
>>>  ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> For what it is worth, a looong time ago (two years? longer?) when I was
>>> helping Manual to get the Debian OpenMPI packages into and when I was
>>> transitioning off LAM, I had concluded that the very latest 7.1.X
>>> releases of
>>> LAM were broken for me.  The system was a then-current Ubuntu system with
>>> the
>>> LAM and OpenMPI packages compiled from Debian sources.  Provided I
>>> 'frozen'
>>> LAM at 7.1.2 things would work, the newer ones would not.
>>>
>>> So I'd recommend either downgrading to the last LAM that worked for you,
>>> or
>>> rather take the plunge and jump to Open MPI. The 1.3.* series is pretty
>>> already, and 1.4.0 is just around the corner.
>>>
>>> Just my $0.02. The problem may of course be entirely different.
>>>
>>> Dirk
>>>
>>> --
>>> Three out of two people have difficulties with fractions.
>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>
>>>
>>
>>
>>
>
> --
> Allan Strand,   Biology    http://linum.cofc.edu
> College of Charleston      Ph. (843) 953-9189
> Charleston, SC 29424       Fax (843) 953-9199
>
>



-- 
Ramon Diaz-Uriarte
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz
Phone: +34-91-732-8000 ext. 3019



More information about the R-sig-hpc mailing list