[R-sig-hpc] snow errors: cannot run slavehostinfo on slaves

Chris Berthiaume chrisbee at uw.edu
Tue Dec 13 20:16:30 CET 2011


I'm getting an error when I try to create an MPI cluster with more
than 1 slave node using snow.  Hopefully somebody on the list has
encountered this before.

> cl <- makeMPIcluster(2)
        2 slaves are spawned successfully. 0 failed.
Error in slave.hostinfo(1) : cannot run slavehostinfo on slaves
[compute-0-0.local:22932] [[48203,0],0]-[[48203,2],0]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[compute-0-0.local:22930] [[48203,1],0] routed:binomial: Connection to
lifeline [[48203,0],0] lost

At this point R exits.  The server this was run on is
compute-0-0.local.  If I run makeMPIcluster(1) the single slave node
is created successfully and can be used, for example clusterCall
works.  I've also tried using mpirun to start the master and worker
process, but this doesn't get me much farther.

$ mpirun RMPISNOW  # <-- mpirun gets processor count of 2 from torque
master (rank 0, comm 1) of size 2 is running on: compute-0-0
slave1 (rank 1, comm 1) of size 2 is running on: compute-0-0
> library(snow)
library(snow)
> cl <- getMPIcluster()
cl <- getMPIcluster()
> cl
cl
NULL
> cl <- makeCluster()
cl <- makeCluster()
> clusterCall(cl, function() Sys.info()[c("nodename")])
clusterCall(cl, function() Sys.info()[c("nodename")])
...hangs...

So getMPIcluster() returns a null object, and using the object
returned by makeCluster causes R to hang.

Other maybe helpful information:

- I can run MPI C code OK across multiple nodes
- I can use Rmpi to create and use slave nodes OK
- Using Centos 5 x86_4
- Using Rmpi 0.5-9
- Using snow 0.3-8
- Using R 2.12.1
- Using OpenMPI 1.4.4

Thanks for any help with this error,
-Chris



More information about the R-sig-hpc mailing list