[R-sig-hpc] snow errors: cannot run slavehostinfo on slaves

Tue Dec 13 22:41:06 CET 2011

Yes, I should have mentioned that I was running an interactive torque
job but looks like you sussed that out from my mpirun comment.

$ qsub -lwalltime=01:00:00,nodes=2:ppn=2 -I

I'll try running the non-interactive mpirun examples you suggested and
see what I get.  Thanks for your help.

-Chris

On Tue, Dec 13, 2011 at 1:00 PM, Stephen Weston
<stephen.b.weston at gmail.com> wrote:
> I had meant to say that these examples assume that you're
> working from an interactive Torque job, which seemed to
> be your situation.  If you started the job with a command
> such as:
>
>  $ qsub -I -l nodes=4 -q devel
>
> then you should get four slots allocated, and mpirun
> will default to starting four processes.  That's why you
> need to use '-n 1' for the spawn case, but don't need
> to use -n for the non-spawn case.
>
> Sorry for any confusion,
>
> - Steve
>
>
> On Tue, Dec 13, 2011 at 3:48 PM, Stephen Weston
> <stephen.b.weston at gmail.com> wrote:
>> Hi Chris,
>>
>> I'm not an MPI expert, but I've seen some problems running
>> snow/Rmpi scripts interactively from R.  I suggest that you work
>> to get the non-interactive case working using mpirun.
>>
>>> I've also tried using mpirun to start the master and worker
>>> process, but this doesn't get me much farther.
>>>
>>> $ mpirun RMPISNOW  # <-- mpirun gets processor count of 2 from torque
>>
>> I suggest trying to run a simple snow/Rmpi script, such as
>> the following, which I'll call mpi.R:
>>
>>  library(snow)
>>  library(Rmpi)
>>  cl <- makeMPIcluster(mpi.universe.size() - 1)
>>  r <- clusterEvalQ(cl, R.version.string)
>>  print(unlist(r))
>>  stopCluster(cl)
>>  mpi.quit()
>>
>> Note that you have to specify the number of workers to
>> makeMPIcluster when spawning the workers.  This uses the
>> mpi.universe.size() function, which will return four in this
>> case, resulting in three spawned workers (since I subtracted
>> one from it).
>>
>> Now run the script using mpirun:
>>
>>  $ mpirun -n 1 R --slave -f mpi.R
>>
>> Notice that I used '-n 1' because I only want mpirun to start
>> one process, which will be the master.  The rest of the
>> processes (the cluster workers) will be spawned by MPI
>> when the master calls makeMPIcluster.
>>
>> If that doesn't work, it's possible that there's a problem with
>> spawning workers in your MPI installation.  Instead, use the
>> following script which doesn't use spawning.  All of the
>> processes are started by mpirun.  I'll call it mpi2.R:
>>
>>  library(snow)
>>  library(Rmpi)
>>  if (mpi.comm.rank(0) > 0) {
>>    sink(file="/dev/null")
>>    slaveLoop(makeMPImaster())
>>    mpi.quit()
>>  }
>>  cl <- makeMPIcluster()
>>  r <- clusterEvalQ(cl, R.version.string)
>>  print(unlist(r))
>>  stopCluster(cl)
>>  mpi.quit()
>>
>> Some extra code is used to make everyone but rank 0 execute the
>> slaveLoop() function.  Only rank 0, which we call the master,
>> actually calls makeMPIcluster(), and it should call
>> makeMPIcluster() without a worker count.
>>
>> This time, you don't use the mpirun -n option, so that mpirun
>> will start four processes in this case.  Rank 0 will become the
>> master and the rest will be workers:
>>
>>  $ mpirun R --slave -f mpi2.R
>>
>> Hopefully one of these two approaches will work for you.
>>
>> Good luck,
>>
>> - Steve
>>
>> On Tue, Dec 13, 2011 at 2:16 PM, Chris Berthiaume <chrisbee at uw.edu> wrote:
>>> I'm getting an error when I try to create an MPI cluster with more
>>> than 1 slave node using snow.  Hopefully somebody on the list has
>>> encountered this before.
>>>
>>>> cl <- makeMPIcluster(2)
>>>        2 slaves are spawned successfully. 0 failed.
>>> Error in slave.hostinfo(1) : cannot run slavehostinfo on slaves
>>> [compute-0-0.local:22932] [[48203,0],0]-[[48203,2],0]
>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>>> [compute-0-0.local:22930] [[48203,1],0] routed:binomial: Connection to
>>> lifeline [[48203,0],0] lost
>>>
>>> At this point R exits.  The server this was run on is
>>> compute-0-0.local.  If I run makeMPIcluster(1) the single slave node
>>> is created successfully and can be used, for example clusterCall
>>> works.  I've also tried using mpirun to start the master and worker
>>> process, but this doesn't get me much farther.
>>>
>>> $ mpirun RMPISNOW  # <-- mpirun gets processor count of 2 from torque
>>> master (rank 0, comm 1) of size 2 is running on: compute-0-0
>>> slave1 (rank 1, comm 1) of size 2 is running on: compute-0-0
>>>> library(snow)
>>> library(snow)
>>>> cl <- getMPIcluster()
>>> cl <- getMPIcluster()
>>>> cl
>>> cl
>>> NULL
>>>> cl <- makeCluster()
>>> cl <- makeCluster()
>>>> clusterCall(cl, function() Sys.info()[c("nodename")])
>>> clusterCall(cl, function() Sys.info()[c("nodename")])
>>> ...hangs...
>>>
>>> So getMPIcluster() returns a null object, and using the object
>>> returned by makeCluster causes R to hang.
>>>
>>> Other maybe helpful information:
>>>
>>> - I can run MPI C code OK across multiple nodes
>>> - I can use Rmpi to create and use slave nodes OK
>>> - Using Centos 5 x86_4
>>> - Using Rmpi 0.5-9
>>> - Using snow 0.3-8
>>> - Using R 2.12.1
>>> - Using OpenMPI 1.4.4
>>>
>>> Thanks for any help with this error,
>>> -Chris
>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc