[R-sig-hpc] Snow Not Distributing
lists at jdadesign.net
Thu Jan 26 20:05:49 CET 2012
THANK YOU! I followed your summary and was able to produce the output
I updated the reference to the hostfile, as you suggested, ensuring that
n was set to only 1.
I really can't thank you enough for posting this -- saved me a TON of time.
[jalle6 at qbri ecoli]$ qsub -I -l nodes=2:ppn=12
qsub: waiting for job 6839.qbri to start
qsub: job 6839.qbri ready
[jalle6 at n010 ~]$ ./Temp/shmpi.sh
24 slaves are spawned successfully. 0 failed.
master (rank 0 , comm 1) of size 25 is running on: n010
slave1 (rank 1 , comm 1) of size 25 is running on: n010
slave2 (rank 2 , comm 1) of size 25 is running on: n010
slave3 (rank 3 , comm 1) of size 25 is running on: n010
... ... ...
slave23 (rank 23, comm 1) of size 25 is running on: n001
slave24 (rank 24, comm 1) of size 25 is running on: n010
On 1/26/2012 11:34 AM, Paul Johnson wrote:
> On Fri, Jan 20, 2012 at 3:53 PM, Jeff Allen<lists at jdadesign.net> wrote:
>> I have been able to successfully setup snow (0.3-5) and Rmpi (0.5-9) on my
>> RedHat 5 cluster, and have it working perfectly for jobs that don't span
>> multiple nodes.
>> We're using Torque for resource management, so I start a job with access to
>> multiple nodes and load Snow. Unfortunately, not matter what size cluster I
>> try to make, all of the workers end up running on the same host -- leaving
>> the other hosts idle.
> Have you solved the problem yet? If not, I can help. I have exactly
> your setup and I have been through EXACTLY the same problems you are
> I've been developing a collection of Rmpi programs that actually work,
> some with Snow, some with parallel.
> This is the cluster main page
> and about 2/3 down, you see a link to my collection of working programs.
> That is an SVN repo that has http access
> In case you are impatient, here is what I suggest. This should be
> your submission script. I mean this works for us.
> #This is an example script example.sh
> #These commands set up the Grid Environment for your job:
> #PBS -N SnowHelloWorld
> #PBS -l nodes=11:ppn=1
> #PBS -l walltime=00:50:00
> #PBS -M pauljohn at ku.edu
> #PBS -m bea
> cd $PBS_O_WORKDIR
> ### This RUNS, and because I give it a machine list, it uses them.
> orterun --hostfile $PBS_NODEFILE -n 1 R --no-save --vanilla -f snow-hello.R
> note that in the orterun command (same as mpirun) I am ONLY REQUESTING
> one node. We let R do the spawning of the jobs. THe PBS command asks
> for 11 nodes
> Then the job for snow-hello.R creates the cluster.
> Why am I pasting this in. I'm crazy. Just go look here for the sub
> script, the program, an explanation, and example output.
>> I'm no expert with MPI or snow, so I'm really not sure how to approach
>> debugging this.
>> Any input would be much appreciated!
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
More information about the R-sig-hpc