[R-sig-hpc] Snow Not Distributing

Thu Jan 26 20:05:49 CET 2012

THANK YOU! I followed your summary and was able to produce the output 
included below.

I updated the reference to the hostfile, as you suggested, ensuring that 
n was set to only 1.

I really can't thank you enough for posting this -- saved me a TON of time.

Jeff

[jalle6 at qbri ecoli]$ qsub -I -l nodes=2:ppn=12
qsub: waiting for job 6839.qbri to start
qsub: job 6839.qbri ready

[jalle6 at n010 ~]$ ./Temp/shmpi.sh
         24 slaves are spawned successfully. 0 failed.
master  (rank 0 , comm 1) of size 25 is running on: n010
slave1  (rank 1 , comm 1) of size 25 is running on: n010
slave2  (rank 2 , comm 1) of size 25 is running on: n010
slave3  (rank 3 , comm 1) of size 25 is running on: n010
... ... ...
slave23 (rank 23, comm 1) of size 25 is running on: n001
slave24 (rank 24, comm 1) of size 25 is running on: n010
$slave1
[1] "n010"

$slave2
[1] "n010"

$slave3
[1] "n010"

$slave4
[1] "n010"

$slave5
[1] "n010"

$slave6
[1] "n010"

$slave7
[1] "n010"

$slave8
[1] "n010"

$slave9
[1] "n010"

$slave10
[1] "n010"

$slave11
[1] "n010"

$slave12
[1] "n001"

$slave13
[1] "n001"

$slave14
[1] "n001"

$slave15
[1] "n001"

$slave16
[1] "n001"

$slave17
[1] "n001"

$slave18
[1] "n001"

$slave19
[1] "n001"

$slave20
[1] "n001"

$slave21
[1] "n001"

$slave22
[1] "n001"

$slave23
[1] "n001"

$slave24
[1] "n010"

[1] 1

On 1/26/2012 11:34 AM, Paul Johnson wrote:
> On Fri, Jan 20, 2012 at 3:53 PM, Jeff Allen<lists at jdadesign.net>  wrote:
>> I have been able to successfully setup snow (0.3-5) and Rmpi (0.5-9) on my
>> RedHat 5 cluster, and have it working perfectly for jobs that don't span
>> multiple nodes.
>>
>> We're using Torque for resource management, so I start a job with access to
>> multiple nodes and load Snow. Unfortunately, not matter what size cluster I
>> try to make, all of the workers end up running on the same host -- leaving
>> the other hosts idle.
> Have you solved the problem yet?  If not, I can help. I have exactly
> your setup and I have been through EXACTLY the same problems you are
> seeing.
>
> I've been developing a collection of Rmpi programs that actually work,
> some with Snow, some with parallel.
>
> This is the cluster main page
>
> http://web.ku.edu/~quant/cgi-bin/mw1/index.php?title=Cluster:Main
>
> and about 2/3 down, you see a link to my collection of working programs.
>
> That is an SVN repo that has http access
>
> http://winstat.quant.ku.edu/svn/hpcexample/trunk
>
> In case you are impatient, here is what I suggest.  This should be
> your submission script. I mean this works for us.
>
> #!/bin/sh
> #
> #This is an example script example.sh
> #
> #These commands set up the Grid Environment for your job:
> #PBS -N SnowHelloWorld
> #PBS -l nodes=11:ppn=1
> #PBS -l walltime=00:50:00
> #PBS -M pauljohn at ku.edu
> #PBS -m bea
>
> cd $PBS_O_WORKDIR
>
> ### This RUNS, and because I give it a machine list, it uses them.
> orterun --hostfile $PBS_NODEFILE -n 1 R --no-save --vanilla -f snow-hello.R
>
> ###############################
>
> note that in the orterun command (same as mpirun) I am ONLY REQUESTING
> one node. We let R do the spawning of the jobs.  THe PBS command asks
> for 11 nodes
>
> Then the job for snow-hello.R creates the cluster.
>
> Why am I pasting this in. I'm crazy. Just go look here for the sub
> script, the program, an explanation, and example output.
>
> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex60-HelloWorldSnow/
>
>> I'm no expert with MPI or snow, so I'm really not sure how to approach
>> debugging this.
>>
>> Any input would be much appreciated!
>>
>> Jeff
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>