[R-sig-hpc] difficulty spawning Rslaves

Allan Strand stranda at cofc.edu
Wed Dec 30 17:33:20 CET 2009


Hi Ramon,
Still having the problem.

Lam definitely seems to be working (step 2 below succeeds).  I've also 
recompiled/reinstalled Rmpi (several to many times).  As for 
mpi.comm.free usage, the truth is that when things are working I only 
use the snow interface, so I really have little facility with Rmpi.  
Will use mpi.close.Rslaves() now.

This error seems to be the problem, I stepped manually through 
mpi.spawn.Rslaves() and everything succeeds until the call to 
mpi.intercomm.merge

Error in mpi.intercomm.merge(intercomm, 0, comm) :
|    MPI_Error_string: process in local group is dead

Still looking,
cheers,
a.

On 12/30/2009 06:02 AM, Ramon Diaz-Uriarte wrote:
> This might not provide any useful info, but just in case: when running
> Rmpi, a bunch of log files are temporarily created in the current
> working directory. Sometimes, they contain a little bit more info than
> "process in local group is dead".
>
> And a couple of other checks:
>
> 1. After installing release 7.1.2, you of course recompiled Rmpi
> against the new versions?
>
>
> 2. Before running R, do your usual lamboot routine and then:
>
> 2.1. lamexec C hostname
>
> 2.2 tping C N -c 2 (or anyother number after -c)
>
>
> 3. Inisde Rmpi, why do you use mpi.comm.free instead of just
> mpi.close.Rslaves? For me, for instance, the following works reliably:
>
> library(Rmpi)
> mpi.spawn.Rslaves(nslaves = 1)
> mpi.close.Rslaves()
> mpi.spawn.Rslaves(nslaves = 4)
>
>
> Best,
>
> R.
>
>
> On Tue, Dec 29, 2009 at 3:37 PM, Allan Strand<stranda at cofc.edu>  wrote:
>    
>> Thanks Dirk and Ramon.
>>
>> I tried Lam 7.1.2 and am still seeing the same type of behavior.  Still
>> searching for a solution, and will report back.
>>
>> cheers,
>> a.
>>
>> On 12/28/2009 12:28 PM, Ramon Diaz-Uriarte wrote:
>>      
>>> More along Dirk's comments: we currently have two clusters using LAM,
>>> both Debian systems, one using v. 7.1.2 of LAM's release and the other
>>> 7.1.1. In a current Ubuntu-based laptop, things are working with
>>> release 7.1.2.
>>>
>>> Best,
>>>
>>> R.
>>>
>>> On Mon, Dec 28, 2009 at 5:14 PM, Dirk Eddelbuettel<edd at debian.org>    wrote:
>>>
>>>        
>>>> Allan,
>>>>
>>>> On 23 December 2009 at 16:05, Allan Strand wrote:
>>>> | My setup is on a cluster running 64bit FC.  I have recently broken my
>>>> | install Rmpi (and hence snow) by upgrading some very old versions of R,
>>>> | lam/mpi, Rmpi, and snow (currently installed versions listed at the
>>>> | bottom of this email).  No doubt this is a problem with my Rmpi
>>>> install,
>>>> | but I'm having trouble seeing it.
>>>> |
>>>> | I cannot seem to spawn more than a single slave (which is spawned on
>>>> the
>>>> | master node)
>>>> | e.g.:
>>>> |
>>>> |>    mpi.spawn.Rslaves(comm=1,nslaves=1)
>>>> |      1 slaves are spawned successfully. 0 failed.
>>>> | master (rank 0, comm 1) of size 2 is running on: node0
>>>> | slave1 (rank 1, comm 1) of size 2 is running on: node0
>>>> |
>>>> |>    mpi.comm.free(comm=1)
>>>> | [1] 1
>>>> |
>>>> |>    mpi.spawn.Rslaves(comm=1,nslaves=2)
>>>> |      2 slaves are spawned successfully. 0 failed.
>>>> | Error in mpi.intercomm.merge(intercomm, 0, comm) :
>>>> |    MPI_Error_string: process in local group is dead
>>>> |
>>>> | No doubt the answer is contained in the MPI_Error string, but I'm not
>>>> | sure how to interpret it.
>>>> |
>>>> | Thanks,
>>>> | Allan
>>>> | ===================================
>>>> | Versions (all installed locally in my account with directory
>>>> appropriate
>>>> | ./configure settings)
>>>> |
>>>> | R 2.10.1
>>>> | LAM 7.1.4/MPI 2 C++/ROMIO - Indiana University
>>>>   ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>
>>>> For what it is worth, a looong time ago (two years? longer?) when I was
>>>> helping Manual to get the Debian OpenMPI packages into and when I was
>>>> transitioning off LAM, I had concluded that the very latest 7.1.X
>>>> releases of
>>>> LAM were broken for me.  The system was a then-current Ubuntu system with
>>>> the
>>>> LAM and OpenMPI packages compiled from Debian sources.  Provided I
>>>> 'frozen'
>>>> LAM at 7.1.2 things would work, the newer ones would not.
>>>>
>>>> So I'd recommend either downgrading to the last LAM that worked for you,
>>>> or
>>>> rather take the plunge and jump to Open MPI. The 1.3.* series is pretty
>>>> already, and 1.4.0 is just around the corner.
>>>>
>>>> Just my $0.02. The problem may of course be entirely different.
>>>>
>>>> Dirk
>>>>
>>>> --
>>>> Three out of two people have difficulties with fractions.
>>>>
>>>> _______________________________________________
>>>> R-sig-hpc mailing list
>>>> R-sig-hpc at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>>
>>>>
>>>>          
>>>
>>>
>>>        
>> --
>> Allan Strand,   Biology    http://linum.cofc.edu
>> College of Charleston      Ph. (843) 953-9189
>> Charleston, SC 29424       Fax (843) 953-9199
>>
>>
>>      
>
>
>    

-- 
Allan Strand,   Biology    http://linum.cofc.edu
College of Charleston      Ph. (843) 953-9189
Charleston, SC 29424       Fax (843) 953-9199



More information about the R-sig-hpc mailing list