[R-sig-hpc] snow errors: cannot run slavehostinfo on slaves

Stephen Weston stephen.b.weston at gmail.com
Tue Dec 13 22:00:24 CET 2011


I had meant to say that these examples assume that you're
working from an interactive Torque job, which seemed to
be your situation.  If you started the job with a command
such as:

  $ qsub -I -l nodes=4 -q devel

then you should get four slots allocated, and mpirun
will default to starting four processes.  That's why you
need to use '-n 1' for the spawn case, but don't need
to use -n for the non-spawn case.

Sorry for any confusion,

- Steve


On Tue, Dec 13, 2011 at 3:48 PM, Stephen Weston
<stephen.b.weston at gmail.com> wrote:
> Hi Chris,
>
> I'm not an MPI expert, but I've seen some problems running
> snow/Rmpi scripts interactively from R.  I suggest that you work
> to get the non-interactive case working using mpirun.
>
>> I've also tried using mpirun to start the master and worker
>> process, but this doesn't get me much farther.
>>
>> $ mpirun RMPISNOW  # <-- mpirun gets processor count of 2 from torque
>
> I suggest trying to run a simple snow/Rmpi script, such as
> the following, which I'll call mpi.R:
>
>  library(snow)
>  library(Rmpi)
>  cl <- makeMPIcluster(mpi.universe.size() - 1)
>  r <- clusterEvalQ(cl, R.version.string)
>  print(unlist(r))
>  stopCluster(cl)
>  mpi.quit()
>
> Note that you have to specify the number of workers to
> makeMPIcluster when spawning the workers.  This uses the
> mpi.universe.size() function, which will return four in this
> case, resulting in three spawned workers (since I subtracted
> one from it).
>
> Now run the script using mpirun:
>
>  $ mpirun -n 1 R --slave -f mpi.R
>
> Notice that I used '-n 1' because I only want mpirun to start
> one process, which will be the master.  The rest of the
> processes (the cluster workers) will be spawned by MPI
> when the master calls makeMPIcluster.
>
> If that doesn't work, it's possible that there's a problem with
> spawning workers in your MPI installation.  Instead, use the
> following script which doesn't use spawning.  All of the
> processes are started by mpirun.  I'll call it mpi2.R:
>
>  library(snow)
>  library(Rmpi)
>  if (mpi.comm.rank(0) > 0) {
>    sink(file="/dev/null")
>    slaveLoop(makeMPImaster())
>    mpi.quit()
>  }
>  cl <- makeMPIcluster()
>  r <- clusterEvalQ(cl, R.version.string)
>  print(unlist(r))
>  stopCluster(cl)
>  mpi.quit()
>
> Some extra code is used to make everyone but rank 0 execute the
> slaveLoop() function.  Only rank 0, which we call the master,
> actually calls makeMPIcluster(), and it should call
> makeMPIcluster() without a worker count.
>
> This time, you don't use the mpirun -n option, so that mpirun
> will start four processes in this case.  Rank 0 will become the
> master and the rest will be workers:
>
>  $ mpirun R --slave -f mpi2.R
>
> Hopefully one of these two approaches will work for you.
>
> Good luck,
>
> - Steve
>
> On Tue, Dec 13, 2011 at 2:16 PM, Chris Berthiaume <chrisbee at uw.edu> wrote:
>> I'm getting an error when I try to create an MPI cluster with more
>> than 1 slave node using snow.  Hopefully somebody on the list has
>> encountered this before.
>>
>>> cl <- makeMPIcluster(2)
>>        2 slaves are spawned successfully. 0 failed.
>> Error in slave.hostinfo(1) : cannot run slavehostinfo on slaves
>> [compute-0-0.local:22932] [[48203,0],0]-[[48203,2],0]
>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>> [compute-0-0.local:22930] [[48203,1],0] routed:binomial: Connection to
>> lifeline [[48203,0],0] lost
>>
>> At this point R exits.  The server this was run on is
>> compute-0-0.local.  If I run makeMPIcluster(1) the single slave node
>> is created successfully and can be used, for example clusterCall
>> works.  I've also tried using mpirun to start the master and worker
>> process, but this doesn't get me much farther.
>>
>> $ mpirun RMPISNOW  # <-- mpirun gets processor count of 2 from torque
>> master (rank 0, comm 1) of size 2 is running on: compute-0-0
>> slave1 (rank 1, comm 1) of size 2 is running on: compute-0-0
>>> library(snow)
>> library(snow)
>>> cl <- getMPIcluster()
>> cl <- getMPIcluster()
>>> cl
>> cl
>> NULL
>>> cl <- makeCluster()
>> cl <- makeCluster()
>>> clusterCall(cl, function() Sys.info()[c("nodename")])
>> clusterCall(cl, function() Sys.info()[c("nodename")])
>> ...hangs...
>>
>> So getMPIcluster() returns a null object, and using the object
>> returned by makeCluster causes R to hang.
>>
>> Other maybe helpful information:
>>
>> - I can run MPI C code OK across multiple nodes
>> - I can use Rmpi to create and use slave nodes OK
>> - Using Centos 5 x86_4
>> - Using Rmpi 0.5-9
>> - Using snow 0.3-8
>> - Using R 2.12.1
>> - Using OpenMPI 1.4.4
>>
>> Thanks for any help with this error,
>> -Chris
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc



More information about the R-sig-hpc mailing list