[R-sig-hpc] Rmpi spawning across nodes

Tue Apr 10 15:56:38 CEST 2012

Hi Ben,

I am not familiar with PBS but when you specify

-l nodes=4:ppn=1

I think you are asking for 4 CPUs spread over _up to_ 4 nodes with _at 
least_ 1 process per node. Depending on the state of the cluster at 
submission time, the queuing system might well give you 4 CPUs on a 
single nodes (which you are sure have have at 4 cores).
Not sure how you enforce having 4 separate physical machines with PBS. 
On SGE you can define a queue with a specific allocation rule ( 
'fill_up' if I remember well) that ensure that you get full machines.

Not sure if this is actually what you want to get though (since you 
require a single CPU on a single machine).
Do you need separate machines? If you need it due to memory usage or if 
each initial single process on the nodes will actually use all available 
CPUs on the node, then you might also want to add a memory specification 
so that other users' jobs don't take up your resources.

The strange thing here is that you say that your cluster only has 
dual-core nodes, so you should get at least 2 different machines instead 
of the 4 you expect.
Also printing/returning the process ID together with the machine name 
could possibly give you better information about how the computation is 
carried out.

Renaud

-- 
Renaud Gaujoux
Computational Biology - University of Cape Town
South Africa

On 10/04/2012 12:00, r-sig-hpc-request at r-project.org wrote:
> Send R-sig-hpc mailing list submissions to
> 	r-sig-hpc at r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> or, via email, send a message with subject or body 'help' to
> 	r-sig-hpc-request at r-project.org
>
> You can reach the person managing the list at
> 	r-sig-hpc-owner at r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of R-sig-hpc digest..."
>
>
> Today's Topics:
>
>     1. Re: Rmpi spawning across nodes. (Stephen Weston)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 9 Apr 2012 16:55:55 -0400
> From: Stephen Weston<stephen.b.weston at gmail.com>
> To: Ben Weinstein<bweinste at life.bio.sunysb.edu>
> Cc: r-sig-hpc at r-project.org, Jan Kasiak<j.kasiak at gmail.com>
> Subject: Re: [R-sig-hpc] Rmpi spawning across nodes.
> Message-ID:
> 	<CALh21iJsUG-QiFxZU1QbP-_dmqOeFrCJS5L=2U9LvfQ7T_Q9Aw at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi Ben,
>
> What machines are listed when you execute:
>
>    cat $PBS_NODEFILE
>
> in your batch script?  Is it definitely four different nodes?
>
> - Steve
>
>
> On Mon, Apr 9, 2012 at 3:28 PM, Ben Weinstein
> <bweinste at life.bio.sunysb.edu>  wrote:
>> Hi Stephen,
>>
>> I've tried to follow your answer, but i'm still getting the same results.
>> the heart of my qsub looks like:
>>
>> mpirun -hostfile $PBS_NODEFILE -np 1 R --slave -f
>> /nfs/user08/bw4sz/Files/Seawulf.R
>>
>>
>> Before i run the foreach statement, i ask what node am i on?
>> [1] "Original Node wulfie121"
>>
>> I make sure the open MPI library is there.
>> [1] "/usr/local/pkg/openmpi-1.4.4/lib/"
>>
>> I make the cluster and ask how many slaves were spawn
>> 4 slaves are spawned successfully. 0 failed.
>>
>> Then i ask what are the nodenames of each of my slaves. I believe that if
>> this is working correctly, each of the nodenames should be different, since
>> i specified?#PBS -l nodes=4:ppn=1
>>
>> However, all the slaves still spawn on that one node.
>> [[1]]
>> ? ?nodename ? ? machine
>> "wulfie121" ? ?"x86_64"
>>
>> [[2]]
>> ? ?nodename ? ? machine
>> "wulfie121" ? ?"x86_64"
>>
>> [[3]]
>> ? ?nodename ? ? machine
>> "wulfie121" ? ?"x86_64"
>>
>> [[4]]
>> ? ?nodename ? ? machine
>> "wulfie121" ? ?"x86_64"
>>
>> Finally, i'm testing how long the process takes to see if i'm actually
>> getting parrelization.
>> [1] 4
>> ? ?user ?system elapsed
>> ?17.650 ?39.990 159.632
>>
>> Again, the heart of the code looks like
>>
>> cl<- makeCluster(4, type = "MPI")
>> print(clusterCall(cl,function() Sys.info()[c("nodename","machine")]))
>> registerDoSNOW(cl)
>> print(getDoParWorkers())
>> system.time(five.ten<- rbind.fill(foreach(j=1:times ) %dopar%
>> drop.shuffle(j,iterations)))
>> stopCluster(cl)
>>
>> I am about to change over to a different parralel backend as suggested, but
>> i doubt that is the root of the problem in this case.
>>
>>
>> I appreciate the continued help,
>>
>> Ben Weinstein
>>
>> On Thu, Mar 29, 2012 at 2:56 PM, Stephen Weston<stephen.b.weston at gmail.com>
>> wrote:
>>> Hi Ben,
>>>
>>> You have to run R via mpirun, otherwise all of the workers start
>>> on the one node.
>>>
>>>> I have tried using mpirun -np 4 in front of the R - call, but this just
>>>> fails without message.
>>> You have to use '-np 1', otherwise your script will be executed
>>> by mpirun four times, each trying to spawn four workers.
>>> I'm not sure if that explains failing without a message, however.
>>>
>>> Try something like this:
>>>
>>> #!/bin/bash
>>> #PBS -o 'qsub.out'
>>> #PBS -e 'qsub.err'
>>> #PBS -l nodes=4:ppn=1
>>> #PBS -m bea
>>> cat $PBS_NODEFILE
>>> hostname
>>>
>>> cd $PBS_O_WORKDIR
>>>
>>> # Run an R script
>>> mpirun -hostfile $PBS_NODEFILE -np 1 R --slave -f
>>> /nfs/user08/bw4sz/Files/Seawulf.R
>>>
>>> You may not need to use '-hostfile $PBS_NODEFILE', depending on
>>> how your Open MPI was built, but I don't think if ever hurts, and
>>> it may be required for your installation.
>>>
>>> - Steve
>>
>>
>>
>> --
>> Ben Weinstein
>> Graduate Student
>> Ecology and Evolution
>> Stony Brook University
>>
>> http://life.bio.sunysb.edu/~bweinste/index.html
>>
>
>
> ------------------------------
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
> End of R-sig-hpc Digest, Vol 43, Issue 3
> ****************************************