[R-sig-hpc] Cluster R "environment" trouble. Using Rmpi

Hao Yu hyu at stats.uwo.ca
Wed Aug 18 20:05:10 CEST 2010


Hi Paul,

Just got back from two conferences.

First of all, when R slaves are spawned, they are "naked", meaning they
are started with basic R functions/lib even that they are in the same dir
with master. You have to tell slaves to get all necessary objects or to
load libraries specifically. There are a few ways to do so.

Use mpi.bcast.Robj2slave(an Robj) to send "an Robj" from master to all
slaves. If a function to be executed on slaves depends on many
functions/data, those functions/data must be sent to slaves first.

Use mpi.bcast.cmd (cmd()) to tell salves to run cmd() like
source("SimJob.R") (make sure to remove any execution commands in
SimJob.R). I don't know if race condition will be an issue since slaves
are competing for the same file.

mpi.scatter.Robj/mpi.gather.Rojb can also be used to send/receive objects
among master and slaves.

Hao


Paul Johnson wrote:
> Hi, everybody.
>
> A user came in with a problem on our Rocks Linux Cluster. His function
> runs fine in an interactive session, but when he sends the function to
> compute nodes with Rmpi, they never return.  I'd not seen that before.
>  We are sending out a few big tasks to a few nodes.
>
> So I took his code, which is hundreds of lines long, spread across 4
> files, and I've been staring at it for hours.  It makes me wonder ...
>
> Question 1. How do auxiliary functions find their way onto compute nodes?
>
> On the master, this sends "SimJob" to the compute nodes. SimJob is
> inside "SimJob.R", as is "pars".  But if SimJob calls other functions,
> how does the compute node find them?
>
> ############################################
> library(Rmpi)
> mpi.spawn.Rslaves(nslaves=4)
>
> source("SimJob.R")
> pars
>
> ExitStatus <- mpi.parApply(pars, MARGIN=1, fun=SimJob)
> cat("\n",table(ExitStatus),"\n")
>
> mpi.close.Rslaves()
> mpi.quit()
> ############################################
>
> The SimJob.R does lots of things, it creates the object "pars" and
> many other functions and definitions.
>
>  "SimJob.R" has some interlinked functions like this:
>
> pre1 <- function(i)   {  whatever; source("someFile.R") }
>
> pre2 <- function (j, something) {  whatever(something);
> source("someOtherFile.R") }
>
> pre3 <- function(i) { whatever }
>
> SimJob <- function(x,i, j){
>     result1 <-  pre(i)
>     result2 <- pre2(j, result1)
>     result3 <- someRFunction(result1, result2)
> }
>
> someRFunction is in an R package, say "lm" or something like that.
>
> How does a compute node  get functions "pre" and "pre2" and the files
> they source?
>
> What if the implementation of pre2 calls some function pre3?
>
> We ARE on an NFS system with home folder available on all compute
> nodes.  But the compute nodes don't inherit the working directory of
> the master, do they?
>
> Here's the frustrating part. I can run interactively on the master
>
>> SimJob( pars[1, ] )
>
> But the whole job won't run on the compute nodes.
>
> 2. Suppose a function that we send to a node tries to write a result.
> It has "save(whatever,file="blha.Rda")  in it.   Where does that file
> go?  What is the "current working directory" on the compute node?
>
> I think that we have to re-write this so we return the information to
> the master node and save it there.
>
>
> 3. Is there a way I can find out what is going on "over there" on a
> compute node while it is working?
>
> I wish I could put a bunch of print statements in so I could track the
> thing's progress, but don't know  how to monitor them.
>
> When this program runs interactively, it spits out some messages to
> StdOut.  On a compute node, where do those go?
>
> I've used the web program "ganglia" to see that nodes are actually
> being used.  They are, using lots of CPU.
>
>
> I've re-worked this code so that it  is all in one file (no more use
> of source).  Still the same thing.
>
> I can run SimJob () on the interactively,  but it never runs on the
> slaves.
>
> Well, so long, I would appreciate your ideas.
>
> --
> Paul E. Johnson
> Professor, Political Science
> 1541 Lilac Lane, Room 504
> University of Kansas
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>


-- 
Department of Statistics & Actuarial Sciences
Fax Phone#:(519)-661-3813
The University of Western Ontario
Office Phone#:(519)-661-3622
London, Ontario N6A 5B7
http://www.stats.uwo.ca/faculty/yu



More information about the R-sig-hpc mailing list