[R-sig-hpc] Cluster R "environment" trouble. Using Rmpi

Tue Sep 7 16:41:42 CEST 2010

> mpi.bcast.Robj2slave( myThing )

will send the R object myThing to all slaves.

> mpi.bcast.cmd( myThing )

will tell all slaves to run myThing. But this just prints the function
myThing itself on slave. You need

mpi.bcast.cmd( myThing() )

to execute the function myThing.

Race condition: when source("someGreatCode.R") is executed on slaves, they
will try to load "someGreatCode.R" at the same time. But a file can only
be accessed once a time. If "someGreatCode.R" contains execution codes, it
is possible that the file access is locked to a particular slave until it
is done.  This prevents other slaves to access "someGreatCode.R" until the
lock is released.

If "someGreatCode.R" load from slaves are slow, it might be better to load
it on master and use mpi.bcastRobj2slave to move all necessary objects to
slaves. Make sure "someGreatCode.R" doesn't contain any lengthy execution
commands.

Hao

Paul Johnson wrote:
> Dear Professor:
>
> Thanks very much for the feedback.  That has helped me to cut errors
> and get one of the user programs working.  One program is still
> causing trouble, and I have a couple of questions below. I am sorry
> that these are so elementary, but if I can understand this, then I can
> write up some clear working examples for everybody.
>
> On Wed, Aug 18, 2010 at 1:05 PM, Hao Yu <hyu at stats.uwo.ca> wrote:
>> Hi Paul,
>>
>> Just got back from two conferences.
>>
>> First of all, when R slaves are spawned, they are "naked", meaning they
>> are started with basic R functions/lib even that they are in the same
>> dir
>> with master. You have to tell slaves to get all necessary objects or to
>> load libraries specifically. There are a few ways to do so.
>>
>> Use mpi.bcast.Robj2slave(an Robj) to send "an Robj" from master to all
>> slaves. If a function to be executed on slaves depends on many
>> functions/data, those functions/data must be sent to slaves first.
>>
>> Use mpi.bcast.cmd (cmd()) to tell salves to run cmd() like
>> source("SimJob.R") (make sure to remove any execution commands in
>> SimJob.R).
>
> Can you please explain what "execution commands" means here?
>
> I am *guessing* that anything that is forbidden in a "sourced" file is
> also forbidden in a function that is passed to a node. Right?
>
> Should it work to hide the source command inside a function, as in:
>
> myThing <- function (i = 0){
>
>      source("someGreatCode.R")
>      ## blah  blah
> }
>
> mpi.bcast.Robj2slave( myThing )
> mpi.bcast.cmd( myThing )
>
> Is that supposed to work?
>
> You mention a "race problem."  Can I ask how I could tell if I have
> that problem?
>
> Here's why I ask. One user has a job that uses source() and it runs
> very slowly.  It may take 1000 x as long as it would if the separate
> parts were sent without Rmpi. I can't understand why it works at all,
> frankly.  Unfortunately, his code is private and I can't share it to
> you, but I'm trying to build a test case to reproduce the problem.  If
> you tell me how to diagnose the race question, then perhaps I can
> figure it out.
>
> In the other cases I've tested, my mistakes cause crashes and there is
> an error message from a node saying that a function is not defined, so
> I know I have to send it to the nodes.  If I can get a program to
> crash, I can usually fix it.  It is the ones that run forever, or
> almost forever, that I can't fix.
>
> pj
>
>> I don't know if race condition will be an issue since slaves
>> are competing for the same file.
>>
>> mpi.scatter.Robj/mpi.gather.Rojb can also be used to send/receive
>> objects
>> among master and slaves.
>>
>> Hao
>>
>>
>> Paul Johnson wrote:
>>> Hi, everybody.
>>>
>>> A user came in with a problem on our Rocks Linux Cluster. His function
>>> runs fine in an interactive session, but when he sends the function to
>>> compute nodes with Rmpi, they never return.  I'd not seen that before.
>>>  We are sending out a few big tasks to a few nodes.
>>>
>>> So I took his code, which is hundreds of lines long, spread across 4
>>> files, and I've been staring at it for hours.  It makes me wonder ...
>>>
>>> Question 1. How do auxiliary functions find their way onto compute
>>> nodes?
>>>
>>> On the master, this sends "SimJob" to the compute nodes. SimJob is
>>> inside "SimJob.R", as is "pars".  But if SimJob calls other functions,
>>> how does the compute node find them?
>>>
>>> ############################################
>>> library(Rmpi)
>>> mpi.spawn.Rslaves(nslaves=4)
>>>
>>> source("SimJob.R")
>>> pars
>>>
>>> ExitStatus <- mpi.parApply(pars, MARGIN=1, fun=SimJob)
>>> cat("\n",table(ExitStatus),"\n")
>>>
>>> mpi.close.Rslaves()
>>> mpi.quit()
>>> ############################################
>>>
>>> The SimJob.R does lots of things, it creates the object "pars" and
>>> many other functions and definitions.
>>>
>>>  "SimJob.R" has some interlinked functions like this:
>>>
>>> pre1 <- function(i)   {  whatever; source("someFile.R") }
>>>
>>> pre2 <- function (j, something) {  whatever(something);
>>> source("someOtherFile.R") }
>>>
>>> pre3 <- function(i) { whatever }
>>>
>>> SimJob <- function(x,i, j){
>>>     result1 <-  pre(i)
>>>     result2 <- pre2(j, result1)
>>>     result3 <- someRFunction(result1, result2)
>>> }
>>>
>>> someRFunction is in an R package, say "lm" or something like that.
>>>
>>> How does a compute node  get functions "pre" and "pre2" and the files
>>> they source?
>>>
>>> What if the implementation of pre2 calls some function pre3?
>>>
>>> We ARE on an NFS system with home folder available on all compute
>>> nodes.  But the compute nodes don't inherit the working directory of
>>> the master, do they?
>>>
>>> Here's the frustrating part. I can run interactively on the master
>>>
>>>> SimJob( pars[1, ] )
>>>
>>> But the whole job won't run on the compute nodes.
>>>
>>> 2. Suppose a function that we send to a node tries to write a result.
>>> It has "save(whatever,file="blha.Rda")  in it.   Where does that file
>>> go?  What is the "current working directory" on the compute node?
>>>
>>> I think that we have to re-write this so we return the information to
>>> the master node and save it there.
>>>
>>>
>>> 3. Is there a way I can find out what is going on "over there" on a
>>> compute node while it is working?
>>>
>>> I wish I could put a bunch of print statements in so I could track the
>>> thing's progress, but don't know  how to monitor them.
>>>
>>> When this program runs interactively, it spits out some messages to
>>> StdOut.  On a compute node, where do those go?
>>>
>>> I've used the web program "ganglia" to see that nodes are actually
>>> being used.  They are, using lots of CPU.
>>>
>>>
>>> I've re-worked this code so that it  is all in one file (no more use
>>> of source).  Still the same thing.
>>>
>>> I can run SimJob () on the interactively,  but it never runs on the
>>> slaves.
>>>
>>> Well, so long, I would appreciate your ideas.
>>>
>>> --
>>> Paul E. Johnson
>>> Professor, Political Science
>>> 1541 Lilac Lane, Room 504
>>> University of Kansas
>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>
>>
>>
>> --
>> Department of Statistics & Actuarial Sciences
>> Fax Phone#:(519)-661-3813
>> The University of Western Ontario
>> Office Phone#:(519)-661-3622
>> London, Ontario N6A 5B7
>> http://www.stats.uwo.ca/faculty/yu
>>
>
>
>
> --
> Paul E. Johnson
> Professor, Political Science
> 1541 Lilac Lane, Room 504
> University of Kansas
>

-- 
Department of Statistics & Actuarial Sciences
Fax Phone#:(519)-661-3813
The University of Western Ontario
Office Phone#:(519)-661-3622
London, Ontario N6A 5B7
http://www.stats.uwo.ca/faculty/yu