[R-sig-hpc] simple question on R/Rmpi/snow/slurm configuration
Whit Armstrong
armstrong.whit at gmail.com
Mon Jan 5 23:31:38 CET 2009
Thanks, Dirk.
I can run your example, but I'm confused about two things.
1) I can only get the jobs to run on node0 (the controller node), no
matter what number I use for -n or -w.
2) I don't understand how to use this example in the context of the
parLapply function. It's possible that I don't understand your
script, but to me it seems like orterun is simply sending this script
out to all the nodes to be executed. What I really want to do is load
my data into a list, then do a parLapply on the list such that each
execution of the function that is applied to the list is allocated out
to a different node.
Sorry that I need so much instruction with this.
Here is the output from running: salloc orterun -n 100 test.mpi.r
(that's your example script).
[,1]
[1,] "linuxsvr.kls.corp 2009-01-05 17:25:56.250"
[2,] "linuxsvr.kls.corp 2009-01-05 17:25:56.257"
[3,] "linuxsvr.kls.corp 2009-01-05 17:25:56.260"
[4,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
[5,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
[6,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
...
and so on. The hostname in all cases is linuxsvr (the controller node).
When I try w/ the -w option, the job just hangs:
[warmstrong at linuxsvr ~]$ salloc -w node[0-4] orterun -n 100 test.mpi.r
salloc: Granted job allocation 118
and the following may prove helpful:
[warmstrong at linuxsvr ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
118 prod orterun warmstro R 1:03 5 node[0-4]
[warmstrong at linuxsvr ~]$ scontrol show nodes
NodeName=node0 State=ALLOCATED CPUs=8 AllocCPUs=8 RealMemory=64000 TmpDisk=0
Sockets=2 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
Arch=x86_64 OS=Linux
NodeName=node1 State=ALLOCATED CPUs=1 AllocCPUs=1 RealMemory=2000 TmpDisk=0
Sockets=1 Cores=1 Threads=1 Weight=1 Features=(null) Reason=(null)
Arch=x86_64 OS=Linux
NodeName=node2 State=ALLOCATED CPUs=4 AllocCPUs=4 RealMemory=2000 TmpDisk=0
Sockets=1 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
Arch=x86_64 OS=Linux
NodeName=node3 State=ALLOCATED CPUs=2 AllocCPUs=2 RealMemory=2000 TmpDisk=0
Sockets=1 Cores=2 Threads=1 Weight=1 Features=(null) Reason=(null)
Arch=x86_64 OS=Linux
NodeName=node4 State=ALLOCATED CPUs=4 AllocCPUs=4 RealMemory=2000 TmpDisk=0
Sockets=1 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
Arch=x86_64 OS=Linux
[warmstrong at linuxsvr ~]$
[warmstrong at linuxsvr ~]$ scontrol show job 118
JobId=118 UserId=warmstrong(11122) GroupId=domain users(10513)
Name=orterun
Priority=4294901641 Partition=prod BatchFlag=0
AllocNode:Sid=linuxsvr:8453 TimeLimit=UNLIMITED ExitCode=0:0
JobState=RUNNING StartTime=01/05-17:27:38 EndTime=NONE
NodeList=node[0-4] NodeListIndices=0-4
AllocCPUs=8,1,4,2,4
ReqProcs=5 ReqNodes=5 ReqS:C:T=1-64.00K:1-64.00K:1-64.00K
Shared=0 Contiguous=0 CPUs/task=0 Licenses=(null)
MinProcs=1 MinSockets=1 MinCores=1 MinThreads=1
MinMemoryNode=0 MinTmpDisk=0 Features=(null)
Dependency=(null) Account=(null) Requeue=1
Reason=None Network=(null)
ReqNodeList=node[0-4] ReqNodeListIndices=0-4
ExcNodeList=(null) ExcNodeListIndices=
SubmitTime=01/05-17:27:38 SuspendTime=None PreSusTime=0
[warmstrong at linuxsvr ~]$
Thanks,
Whit
On Mon, Jan 5, 2009 at 4:40 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
>
> On 5 January 2009 at 16:04, Whit Armstrong wrote:
> | > library(Rmpi)
> | library(Rmpi)
> | [linuxsvr.kls.corp:09097] mca: base: component_find: unable to open
> | osc pt2pt: file not found (ignored)
> | > library(snow)
> | library(snow)
> | > cl <- getMPIcluster()
> | cl <- getMPIcluster()
> | > cl
>
> I don't think that works. You need to be explicit in the creation of the
> cluster. The best trick I found was in re-factoring / abstracting-out what
> snow does in its internal scripts. I showed that in the UseR talk (as opposed
> to tutorial) and picked it up in last months presentation. It goes as
> follows:
>
> -----------------------------------------------------------------------------
> #!/usr/bin/env r
>
> suppressMessages(library(Rmpi))
> suppressMessages(library(snow))
>
> #mpirank <- mpi.comm.rank(0) # just FYI
> ndsvpid <- Sys.getenv("OMPI_MCA_ns_nds_vpid")
> if (ndsvpid == "0") { # are we master ?
> #cat("Launching master (OMPI_MCA_ns_nds_vpid=", ndsvpid, " mpi rank=", mpirank, ")\n")
> makeMPIcluster()
> } else { # or are we a slave ?
> #cat("Launching slave with (OMPI_MCA_ns_nds_vpid=", ndsvpid, " mpi rank=", mpirank, ")\n")
> sink(file="/dev/null")
> slaveLoop(makeMPImaster())
> q()
> }
>
> ## a trivial main body, but note how getMPIcluster() learns from the
> ## launched cluster how many nodes are available
> cl <- getMPIcluster()
> clusterEvalQ(cl, options("digits.secs"=3)) ## use millisecond
> ## granularity
> res <- clusterCall(cl, function() paste(Sys.info()["nodename"],
> ## format(Sys.time())))
> print(do.call(rbind,res))
> stopCluster(cl)
> -----------------------------------------------------------------------------
>
> which you can launch via salloc as Martin suggested to create a slurm
> allocation. Then use orterun to actually use it have orterun call your
> script. I tend to wrap things into littler script. I.e. something like
>
> $ salloc -w host[1-32] orterun -n 8 nameOfTheScriptAbove.r
>
> where you should then see 7 hosts (as one acts as the dispatching controller,
> so you get N-1 working out of N assigned by orterun).
>
> This has the advantage of never hard-coding how many nodes you use. It is all
> driven from the commandline. If you always have the same fixed nodes, then
> it easier to just use the default snow cluster creation with hard-wired
> nodes.
>
> Hth, Dirk
>
>
> --
> Three out of two people have difficulties with fractions.
>
More information about the R-sig-hpc
mailing list