[R-sig-hpc] Snow Not Distributing
lists at jdadesign.net
Mon Jan 23 16:31:31 CET 2012
On 1/20/2012 5:19 PM, Brian G. Peterson wrote:
> On Fri, 2012-01-20 at 15:53 -0600, Jeff Allen wrote:
>> I have been able to successfully setup snow (0.3-5) and Rmpi (0.5-9) on
>> my RedHat 5 cluster, and have it working perfectly for jobs that don't
>> span multiple nodes.
>> We're using Torque for resource management, so I start a job with access
>> to multiple nodes and load Snow. Unfortunately, not matter what size
>> cluster I try to make, all of the workers end up running on the same
>> host -- leaving the other hosts idle.
>> I'm no expert with MPI or snow, so I'm really not sure how to approach
>> debugging this.
>> Any input would be much appreciated!
> I would suggest trying a simple 'hello world' type script just using
> Rmpi, e.g. one of the examples from Dirk's Intro to HPC with R
> presentation (I'm not somewhere I can easily search for it, but you
> should be able to find it on Dirk's site or a link in the list archives
> - Brian
Thanks for the suggestion, Brian. I've gone through the following "Hello
World" steps and don't get the desired result.
qsub -I -l nodes=2:ppn=12 #spawn a 2 node, 12 core/node job
I then went through the following R session:
1 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 2 is running on: n004
slave1 (rank 1, comm 1) of size 2 is running on: n004
> mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
 "I am 1 of 2"
I get the same results in interactive and batch (Rscript) submissions of
this job. My expectation was that it would spawn 1 worker per core (24
of which are available), or at least one worker per node (n004 and n013
were allocated by Torque for this request).
Let me know what additional information/output I can provide.
More information about the R-sig-hpc