[R-sig-hpc] difficulty spawning Rslaves

Ramon Diaz-Uriarte rdiaz02 at gmail.com
Wed Dec 30 18:41:42 CET 2009


Hummm... Already beyond my very little understanding of LAM/MPI.

I guess that booting the LAM universe (with at least, say, 2 slaves,
i.e., setting that in the config file) and then starting R and doing,
directly

library(Rmpi)
mpi.spawn.Rslaves(nslaves = 2)

also fails?


Best,

R.



On Wed, Dec 30, 2009 at 5:33 PM, Allan Strand <stranda at cofc.edu> wrote:
> Hi Ramon,
> Still having the problem.
>
> Lam definitely seems to be working (step 2 below succeeds).  I've also
> recompiled/reinstalled Rmpi (several to many times).  As for mpi.comm.free
> usage, the truth is that when things are working I only use the snow
> interface, so I really have little facility with Rmpi.  Will use
> mpi.close.Rslaves() now.
>
> This error seems to be the problem, I stepped manually through
> mpi.spawn.Rslaves() and everything succeeds until the call to
> mpi.intercomm.merge
>
> Error in mpi.intercomm.merge(intercomm, 0, comm) :
> |    MPI_Error_string: process in local group is dead
>
> Still looking,
> cheers,
> a.
>
> On 12/30/2009 06:02 AM, Ramon Diaz-Uriarte wrote:
>>
>> This might not provide any useful info, but just in case: when running
>> Rmpi, a bunch of log files are temporarily created in the current
>> working directory. Sometimes, they contain a little bit more info than
>> "process in local group is dead".
>>
>> And a couple of other checks:
>>
>> 1. After installing release 7.1.2, you of course recompiled Rmpi
>> against the new versions?
>>
>>
>> 2. Before running R, do your usual lamboot routine and then:
>>
>> 2.1. lamexec C hostname
>>
>> 2.2 tping C N -c 2 (or anyother number after -c)
>>
>>
>> 3. Inisde Rmpi, why do you use mpi.comm.free instead of just
>> mpi.close.Rslaves? For me, for instance, the following works reliably:
>>
>> library(Rmpi)
>> mpi.spawn.Rslaves(nslaves = 1)
>> mpi.close.Rslaves()
>> mpi.spawn.Rslaves(nslaves = 4)
>>
>>
>> Best,
>>
>> R.
>>
>>
>> On Tue, Dec 29, 2009 at 3:37 PM, Allan Strand<stranda at cofc.edu>  wrote:
>>
>>>
>>> Thanks Dirk and Ramon.
>>>
>>> I tried Lam 7.1.2 and am still seeing the same type of behavior.  Still
>>> searching for a solution, and will report back.
>>>
>>> cheers,
>>> a.
>>>
>>> On 12/28/2009 12:28 PM, Ramon Diaz-Uriarte wrote:
>>>
>>>>
>>>> More along Dirk's comments: we currently have two clusters using LAM,
>>>> both Debian systems, one using v. 7.1.2 of LAM's release and the other
>>>> 7.1.1. In a current Ubuntu-based laptop, things are working with
>>>> release 7.1.2.
>>>>
>>>> Best,
>>>>
>>>> R.
>>>>
>>>> On Mon, Dec 28, 2009 at 5:14 PM, Dirk Eddelbuettel<edd at debian.org>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> Allan,
>>>>>
>>>>> On 23 December 2009 at 16:05, Allan Strand wrote:
>>>>> | My setup is on a cluster running 64bit FC.  I have recently broken my
>>>>> | install Rmpi (and hence snow) by upgrading some very old versions of
>>>>> R,
>>>>> | lam/mpi, Rmpi, and snow (currently installed versions listed at the
>>>>> | bottom of this email).  No doubt this is a problem with my Rmpi
>>>>> install,
>>>>> | but I'm having trouble seeing it.
>>>>> |
>>>>> | I cannot seem to spawn more than a single slave (which is spawned on
>>>>> the
>>>>> | master node)
>>>>> | e.g.:
>>>>> |
>>>>> |>    mpi.spawn.Rslaves(comm=1,nslaves=1)
>>>>> |      1 slaves are spawned successfully. 0 failed.
>>>>> | master (rank 0, comm 1) of size 2 is running on: node0
>>>>> | slave1 (rank 1, comm 1) of size 2 is running on: node0
>>>>> |
>>>>> |>    mpi.comm.free(comm=1)
>>>>> | [1] 1
>>>>> |
>>>>> |>    mpi.spawn.Rslaves(comm=1,nslaves=2)
>>>>> |      2 slaves are spawned successfully. 0 failed.
>>>>> | Error in mpi.intercomm.merge(intercomm, 0, comm) :
>>>>> |    MPI_Error_string: process in local group is dead
>>>>> |
>>>>> | No doubt the answer is contained in the MPI_Error string, but I'm not
>>>>> | sure how to interpret it.
>>>>> |
>>>>> | Thanks,
>>>>> | Allan
>>>>> | ===================================
>>>>> | Versions (all installed locally in my account with directory
>>>>> appropriate
>>>>> | ./configure settings)
>>>>> |
>>>>> | R 2.10.1
>>>>> | LAM 7.1.4/MPI 2 C++/ROMIO - Indiana University
>>>>>  ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>
>>>>> For what it is worth, a looong time ago (two years? longer?) when I was
>>>>> helping Manual to get the Debian OpenMPI packages into and when I was
>>>>> transitioning off LAM, I had concluded that the very latest 7.1.X
>>>>> releases of
>>>>> LAM were broken for me.  The system was a then-current Ubuntu system
>>>>> with
>>>>> the
>>>>> LAM and OpenMPI packages compiled from Debian sources.  Provided I
>>>>> 'frozen'
>>>>> LAM at 7.1.2 things would work, the newer ones would not.
>>>>>
>>>>> So I'd recommend either downgrading to the last LAM that worked for
>>>>> you,
>>>>> or
>>>>> rather take the plunge and jump to Open MPI. The 1.3.* series is pretty
>>>>> already, and 1.4.0 is just around the corner.
>>>>>
>>>>> Just my $0.02. The problem may of course be entirely different.
>>>>>
>>>>> Dirk
>>>>>
>>>>> --
>>>>> Three out of two people have difficulties with fractions.
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-hpc mailing list
>>>>> R-sig-hpc at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Allan Strand,   Biology    http://linum.cofc.edu
>>> College of Charleston      Ph. (843) 953-9189
>>> Charleston, SC 29424       Fax (843) 953-9199
>>>
>>>
>>>
>>
>>
>>
>
> --
> Allan Strand,   Biology    http://linum.cofc.edu
> College of Charleston      Ph. (843) 953-9189
> Charleston, SC 29424       Fax (843) 953-9199
>
>



-- 
Ramon Diaz-Uriarte
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz
Phone: +34-91-732-8000 ext. 3019



More information about the R-sig-hpc mailing list