[R-sig-hpc] simple question on R/Rmpi/snow/slurm configuration

Whit Armstrong armstrong.whit at gmail.com
Mon Jan 5 23:31:38 CET 2009


Thanks, Dirk.

I can run your example, but I'm confused about two things.

1) I can only get the jobs to run on node0 (the controller node), no
matter what number I use for -n or -w.

2) I don't understand how to use this example in the context of the
parLapply function.  It's possible that I don't understand your
script, but to me it seems like orterun is simply sending this script
out to all the nodes to be executed.  What I really want to do is load
my data into a list, then do a parLapply on the list such that each
execution of the function that is applied to the list is allocated out
to a different node.

Sorry that I need so much instruction with this.

Here is the output from running: salloc orterun -n 100 test.mpi.r
(that's your example script).
      [,1]
  [1,] "linuxsvr.kls.corp 2009-01-05 17:25:56.250"
  [2,] "linuxsvr.kls.corp 2009-01-05 17:25:56.257"
  [3,] "linuxsvr.kls.corp 2009-01-05 17:25:56.260"
  [4,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
  [5,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
  [6,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
...
and so on.  The hostname in all cases is linuxsvr (the controller node).

When I try w/ the -w option, the job just hangs:

[warmstrong at linuxsvr ~]$ salloc -w node[0-4] orterun -n 100 test.mpi.r
salloc: Granted job allocation 118


and the following may prove helpful:

[warmstrong at linuxsvr ~]$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    118      prod  orterun warmstro   R       1:03      5 node[0-4]

[warmstrong at linuxsvr ~]$ scontrol show nodes
NodeName=node0 State=ALLOCATED CPUs=8 AllocCPUs=8 RealMemory=64000 TmpDisk=0
   Sockets=2 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
NodeName=node1 State=ALLOCATED CPUs=1 AllocCPUs=1 RealMemory=2000 TmpDisk=0
   Sockets=1 Cores=1 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
NodeName=node2 State=ALLOCATED CPUs=4 AllocCPUs=4 RealMemory=2000 TmpDisk=0
   Sockets=1 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
NodeName=node3 State=ALLOCATED CPUs=2 AllocCPUs=2 RealMemory=2000 TmpDisk=0
   Sockets=1 Cores=2 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
NodeName=node4 State=ALLOCATED CPUs=4 AllocCPUs=4 RealMemory=2000 TmpDisk=0
   Sockets=1 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
[warmstrong at linuxsvr ~]$


[warmstrong at linuxsvr ~]$ scontrol show job 118
JobId=118 UserId=warmstrong(11122) GroupId=domain users(10513)
   Name=orterun
   Priority=4294901641 Partition=prod BatchFlag=0
   AllocNode:Sid=linuxsvr:8453 TimeLimit=UNLIMITED ExitCode=0:0
   JobState=RUNNING StartTime=01/05-17:27:38 EndTime=NONE
   NodeList=node[0-4] NodeListIndices=0-4
   AllocCPUs=8,1,4,2,4
   ReqProcs=5 ReqNodes=5 ReqS:C:T=1-64.00K:1-64.00K:1-64.00K
   Shared=0 Contiguous=0 CPUs/task=0 Licenses=(null)
   MinProcs=1 MinSockets=1 MinCores=1 MinThreads=1
   MinMemoryNode=0 MinTmpDisk=0 Features=(null)
   Dependency=(null) Account=(null) Requeue=1
   Reason=None Network=(null)
   ReqNodeList=node[0-4] ReqNodeListIndices=0-4
   ExcNodeList=(null) ExcNodeListIndices=
   SubmitTime=01/05-17:27:38 SuspendTime=None PreSusTime=0

[warmstrong at linuxsvr ~]$


Thanks,
Whit



On Mon, Jan 5, 2009 at 4:40 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
>
> On 5 January 2009 at 16:04, Whit Armstrong wrote:
> | > library(Rmpi)
> | library(Rmpi)
> | [linuxsvr.kls.corp:09097] mca: base: component_find: unable to open
> | osc pt2pt: file not found (ignored)
> | > library(snow)
> | library(snow)
> | >  cl <- getMPIcluster()
> |  cl <- getMPIcluster()
> | > cl
>
> I don't think that works.  You need to be explicit in the creation of the
> cluster.  The best trick I found was in re-factoring / abstracting-out what
> snow does in its internal scripts. I showed that in the UseR talk (as opposed
> to tutorial) and picked it up in last months presentation. It goes as
> follows:
>
> -----------------------------------------------------------------------------
> #!/usr/bin/env r
>
> suppressMessages(library(Rmpi))
> suppressMessages(library(snow))
>
> #mpirank <- mpi.comm.rank(0)    # just FYI
> ndsvpid <- Sys.getenv("OMPI_MCA_ns_nds_vpid")
> if (ndsvpid == "0") {                   # are we master ?
>    #cat("Launching master (OMPI_MCA_ns_nds_vpid=", ndsvpid, " mpi rank=",     mpirank, ")\n")
>    makeMPIcluster()
> } else {                                # or are we a slave ?
>    #cat("Launching slave with (OMPI_MCA_ns_nds_vpid=", ndsvpid, " mpi rank=", mpirank, ")\n")
>    sink(file="/dev/null")
>    slaveLoop(makeMPImaster())
>    q()
> }
>
> ## a trivial main body, but note how getMPIcluster() learns from the
> ## launched cluster how many nodes are available
> cl <- getMPIcluster()
> clusterEvalQ(cl, options("digits.secs"=3))      ## use millisecond
> ## granularity
> res <- clusterCall(cl, function() paste(Sys.info()["nodename"],
> ## format(Sys.time())))
> print(do.call(rbind,res))
> stopCluster(cl)
> -----------------------------------------------------------------------------
>
> which you can launch via salloc as Martin suggested to create a slurm
> allocation. Then use orterun to actually use it have orterun call your
> script. I tend to wrap things into littler script.  I.e. something like
>
>      $ salloc -w host[1-32] orterun -n 8 nameOfTheScriptAbove.r
>
> where you should then see 7 hosts (as one acts as the dispatching controller,
> so you get N-1 working out of N assigned by orterun).
>
> This has the advantage of never hard-coding how many nodes you use. It is all
> driven from the commandline.  If you always have the same fixed nodes, then
> it easier to just use the default snow cluster creation with hard-wired
> nodes.
>
> Hth,  Dirk
>
>
> --
> Three out of two people have difficulties with fractions.
>



More information about the R-sig-hpc mailing list