[R-sig-hpc] snow errors: cannot run slavehostinfo on slaves

Stephen Weston stephen.b.weston at gmail.com
Tue Dec 13 21:48:05 CET 2011


Hi Chris,

I'm not an MPI expert, but I've seen some problems running
snow/Rmpi scripts interactively from R.  I suggest that you work
to get the non-interactive case working using mpirun.

> I've also tried using mpirun to start the master and worker
> process, but this doesn't get me much farther.
>
> $ mpirun RMPISNOW  # <-- mpirun gets processor count of 2 from torque

I suggest trying to run a simple snow/Rmpi script, such as
the following, which I'll call mpi.R:

  library(snow)
  library(Rmpi)
  cl <- makeMPIcluster(mpi.universe.size() - 1)
  r <- clusterEvalQ(cl, R.version.string)
  print(unlist(r))
  stopCluster(cl)
  mpi.quit()

Note that you have to specify the number of workers to
makeMPIcluster when spawning the workers.  This uses the
mpi.universe.size() function, which will return four in this
case, resulting in three spawned workers (since I subtracted
one from it).

Now run the script using mpirun:

  $ mpirun -n 1 R --slave -f mpi.R

Notice that I used '-n 1' because I only want mpirun to start
one process, which will be the master.  The rest of the
processes (the cluster workers) will be spawned by MPI
when the master calls makeMPIcluster.

If that doesn't work, it's possible that there's a problem with
spawning workers in your MPI installation.  Instead, use the
following script which doesn't use spawning.  All of the
processes are started by mpirun.  I'll call it mpi2.R:

  library(snow)
  library(Rmpi)
  if (mpi.comm.rank(0) > 0) {
    sink(file="/dev/null")
    slaveLoop(makeMPImaster())
    mpi.quit()
  }
  cl <- makeMPIcluster()
  r <- clusterEvalQ(cl, R.version.string)
  print(unlist(r))
  stopCluster(cl)
  mpi.quit()

Some extra code is used to make everyone but rank 0 execute the
slaveLoop() function.  Only rank 0, which we call the master,
actually calls makeMPIcluster(), and it should call
makeMPIcluster() without a worker count.

This time, you don't use the mpirun -n option, so that mpirun
will start four processes in this case.  Rank 0 will become the
master and the rest will be workers:

  $ mpirun R --slave -f mpi2.R

Hopefully one of these two approaches will work for you.

Good luck,

- Steve

On Tue, Dec 13, 2011 at 2:16 PM, Chris Berthiaume <chrisbee at uw.edu> wrote:
> I'm getting an error when I try to create an MPI cluster with more
> than 1 slave node using snow.  Hopefully somebody on the list has
> encountered this before.
>
>> cl <- makeMPIcluster(2)
>        2 slaves are spawned successfully. 0 failed.
> Error in slave.hostinfo(1) : cannot run slavehostinfo on slaves
> [compute-0-0.local:22932] [[48203,0],0]-[[48203,2],0]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [compute-0-0.local:22930] [[48203,1],0] routed:binomial: Connection to
> lifeline [[48203,0],0] lost
>
> At this point R exits.  The server this was run on is
> compute-0-0.local.  If I run makeMPIcluster(1) the single slave node
> is created successfully and can be used, for example clusterCall
> works.  I've also tried using mpirun to start the master and worker
> process, but this doesn't get me much farther.
>
> $ mpirun RMPISNOW  # <-- mpirun gets processor count of 2 from torque
> master (rank 0, comm 1) of size 2 is running on: compute-0-0
> slave1 (rank 1, comm 1) of size 2 is running on: compute-0-0
>> library(snow)
> library(snow)
>> cl <- getMPIcluster()
> cl <- getMPIcluster()
>> cl
> cl
> NULL
>> cl <- makeCluster()
> cl <- makeCluster()
>> clusterCall(cl, function() Sys.info()[c("nodename")])
> clusterCall(cl, function() Sys.info()[c("nodename")])
> ...hangs...
>
> So getMPIcluster() returns a null object, and using the object
> returned by makeCluster causes R to hang.
>
> Other maybe helpful information:
>
> - I can run MPI C code OK across multiple nodes
> - I can use Rmpi to create and use slave nodes OK
> - Using Centos 5 x86_4
> - Using Rmpi 0.5-9
> - Using snow 0.3-8
> - Using R 2.12.1
> - Using OpenMPI 1.4.4
>
> Thanks for any help with this error,
> -Chris
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc



More information about the R-sig-hpc mailing list