[R-sig-hpc] Snow Not Distributing
Jeff Allen
lists at jdadesign.net
Thu Jan 26 20:05:49 CET 2012
THANK YOU! I followed your summary and was able to produce the output
included below.
I updated the reference to the hostfile, as you suggested, ensuring that
n was set to only 1.
I really can't thank you enough for posting this -- saved me a TON of time.
Jeff
[jalle6 at qbri ecoli]$ qsub -I -l nodes=2:ppn=12
qsub: waiting for job 6839.qbri to start
qsub: job 6839.qbri ready
[jalle6 at n010 ~]$ ./Temp/shmpi.sh
24 slaves are spawned successfully. 0 failed.
master (rank 0 , comm 1) of size 25 is running on: n010
slave1 (rank 1 , comm 1) of size 25 is running on: n010
slave2 (rank 2 , comm 1) of size 25 is running on: n010
slave3 (rank 3 , comm 1) of size 25 is running on: n010
... ... ...
slave23 (rank 23, comm 1) of size 25 is running on: n001
slave24 (rank 24, comm 1) of size 25 is running on: n010
$slave1
[1] "n010"
$slave2
[1] "n010"
$slave3
[1] "n010"
$slave4
[1] "n010"
$slave5
[1] "n010"
$slave6
[1] "n010"
$slave7
[1] "n010"
$slave8
[1] "n010"
$slave9
[1] "n010"
$slave10
[1] "n010"
$slave11
[1] "n010"
$slave12
[1] "n001"
$slave13
[1] "n001"
$slave14
[1] "n001"
$slave15
[1] "n001"
$slave16
[1] "n001"
$slave17
[1] "n001"
$slave18
[1] "n001"
$slave19
[1] "n001"
$slave20
[1] "n001"
$slave21
[1] "n001"
$slave22
[1] "n001"
$slave23
[1] "n001"
$slave24
[1] "n010"
[1] 1
On 1/26/2012 11:34 AM, Paul Johnson wrote:
> On Fri, Jan 20, 2012 at 3:53 PM, Jeff Allen<lists at jdadesign.net> wrote:
>> I have been able to successfully setup snow (0.3-5) and Rmpi (0.5-9) on my
>> RedHat 5 cluster, and have it working perfectly for jobs that don't span
>> multiple nodes.
>>
>> We're using Torque for resource management, so I start a job with access to
>> multiple nodes and load Snow. Unfortunately, not matter what size cluster I
>> try to make, all of the workers end up running on the same host -- leaving
>> the other hosts idle.
> Have you solved the problem yet? If not, I can help. I have exactly
> your setup and I have been through EXACTLY the same problems you are
> seeing.
>
> I've been developing a collection of Rmpi programs that actually work,
> some with Snow, some with parallel.
>
> This is the cluster main page
>
> http://web.ku.edu/~quant/cgi-bin/mw1/index.php?title=Cluster:Main
>
> and about 2/3 down, you see a link to my collection of working programs.
>
> That is an SVN repo that has http access
>
> http://winstat.quant.ku.edu/svn/hpcexample/trunk
>
> In case you are impatient, here is what I suggest. This should be
> your submission script. I mean this works for us.
>
> #!/bin/sh
> #
> #This is an example script example.sh
> #
> #These commands set up the Grid Environment for your job:
> #PBS -N SnowHelloWorld
> #PBS -l nodes=11:ppn=1
> #PBS -l walltime=00:50:00
> #PBS -M pauljohn at ku.edu
> #PBS -m bea
>
> cd $PBS_O_WORKDIR
>
> ### This RUNS, and because I give it a machine list, it uses them.
> orterun --hostfile $PBS_NODEFILE -n 1 R --no-save --vanilla -f snow-hello.R
>
> ###############################
>
> note that in the orterun command (same as mpirun) I am ONLY REQUESTING
> one node. We let R do the spawning of the jobs. THe PBS command asks
> for 11 nodes
>
> Then the job for snow-hello.R creates the cluster.
>
> Why am I pasting this in. I'm crazy. Just go look here for the sub
> script, the program, an explanation, and example output.
>
> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex60-HelloWorldSnow/
>
>> I'm no expert with MPI or snow, so I'm really not sure how to approach
>> debugging this.
>>
>> Any input would be much appreciated!
>>
>> Jeff
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
More information about the R-sig-hpc
mailing list