[R-sig-hpc] Cluster R "environment" trouble. Using Rmpi

Paul Johnson pauljohn32 at gmail.com
Tue Sep 7 04:27:29 CEST 2010


Dear Professor:

Thanks very much for the feedback.  That has helped me to cut errors
and get one of the user programs working.  One program is still
causing trouble, and I have a couple of questions below. I am sorry
that these are so elementary, but if I can understand this, then I can
write up some clear working examples for everybody.

On Wed, Aug 18, 2010 at 1:05 PM, Hao Yu <hyu at stats.uwo.ca> wrote:
> Hi Paul,
>
> Just got back from two conferences.
>
> First of all, when R slaves are spawned, they are "naked", meaning they
> are started with basic R functions/lib even that they are in the same dir
> with master. You have to tell slaves to get all necessary objects or to
> load libraries specifically. There are a few ways to do so.
>
> Use mpi.bcast.Robj2slave(an Robj) to send "an Robj" from master to all
> slaves. If a function to be executed on slaves depends on many
> functions/data, those functions/data must be sent to slaves first.
>
> Use mpi.bcast.cmd (cmd()) to tell salves to run cmd() like
> source("SimJob.R") (make sure to remove any execution commands in
> SimJob.R).

Can you please explain what "execution commands" means here?

I am *guessing* that anything that is forbidden in a "sourced" file is
also forbidden in a function that is passed to a node. Right?

Should it work to hide the source command inside a function, as in:

myThing <- function (i = 0){

     source("someGreatCode.R")
     ## blah  blah
}

mpi.bcast.Robj2slave( myThing )
mpi.bcast.cmd( myThing )

Is that supposed to work?

You mention a "race problem."  Can I ask how I could tell if I have
that problem?

Here's why I ask. One user has a job that uses source() and it runs
very slowly.  It may take 1000 x as long as it would if the separate
parts were sent without Rmpi. I can't understand why it works at all,
frankly.  Unfortunately, his code is private and I can't share it to
you, but I'm trying to build a test case to reproduce the problem.  If
you tell me how to diagnose the race question, then perhaps I can
figure it out.

In the other cases I've tested, my mistakes cause crashes and there is
an error message from a node saying that a function is not defined, so
I know I have to send it to the nodes.  If I can get a program to
crash, I can usually fix it.  It is the ones that run forever, or
almost forever, that I can't fix.

pj

> I don't know if race condition will be an issue since slaves
> are competing for the same file.
>
> mpi.scatter.Robj/mpi.gather.Rojb can also be used to send/receive objects
> among master and slaves.
>
> Hao
>
>
> Paul Johnson wrote:
>> Hi, everybody.
>>
>> A user came in with a problem on our Rocks Linux Cluster. His function
>> runs fine in an interactive session, but when he sends the function to
>> compute nodes with Rmpi, they never return.  I'd not seen that before.
>>  We are sending out a few big tasks to a few nodes.
>>
>> So I took his code, which is hundreds of lines long, spread across 4
>> files, and I've been staring at it for hours.  It makes me wonder ...
>>
>> Question 1. How do auxiliary functions find their way onto compute nodes?
>>
>> On the master, this sends "SimJob" to the compute nodes. SimJob is
>> inside "SimJob.R", as is "pars".  But if SimJob calls other functions,
>> how does the compute node find them?
>>
>> ############################################
>> library(Rmpi)
>> mpi.spawn.Rslaves(nslaves=4)
>>
>> source("SimJob.R")
>> pars
>>
>> ExitStatus <- mpi.parApply(pars, MARGIN=1, fun=SimJob)
>> cat("\n",table(ExitStatus),"\n")
>>
>> mpi.close.Rslaves()
>> mpi.quit()
>> ############################################
>>
>> The SimJob.R does lots of things, it creates the object "pars" and
>> many other functions and definitions.
>>
>>  "SimJob.R" has some interlinked functions like this:
>>
>> pre1 <- function(i)   {  whatever; source("someFile.R") }
>>
>> pre2 <- function (j, something) {  whatever(something);
>> source("someOtherFile.R") }
>>
>> pre3 <- function(i) { whatever }
>>
>> SimJob <- function(x,i, j){
>>     result1 <-  pre(i)
>>     result2 <- pre2(j, result1)
>>     result3 <- someRFunction(result1, result2)
>> }
>>
>> someRFunction is in an R package, say "lm" or something like that.
>>
>> How does a compute node  get functions "pre" and "pre2" and the files
>> they source?
>>
>> What if the implementation of pre2 calls some function pre3?
>>
>> We ARE on an NFS system with home folder available on all compute
>> nodes.  But the compute nodes don't inherit the working directory of
>> the master, do they?
>>
>> Here's the frustrating part. I can run interactively on the master
>>
>>> SimJob( pars[1, ] )
>>
>> But the whole job won't run on the compute nodes.
>>
>> 2. Suppose a function that we send to a node tries to write a result.
>> It has "save(whatever,file="blha.Rda")  in it.   Where does that file
>> go?  What is the "current working directory" on the compute node?
>>
>> I think that we have to re-write this so we return the information to
>> the master node and save it there.
>>
>>
>> 3. Is there a way I can find out what is going on "over there" on a
>> compute node while it is working?
>>
>> I wish I could put a bunch of print statements in so I could track the
>> thing's progress, but don't know  how to monitor them.
>>
>> When this program runs interactively, it spits out some messages to
>> StdOut.  On a compute node, where do those go?
>>
>> I've used the web program "ganglia" to see that nodes are actually
>> being used.  They are, using lots of CPU.
>>
>>
>> I've re-worked this code so that it  is all in one file (no more use
>> of source).  Still the same thing.
>>
>> I can run SimJob () on the interactively,  but it never runs on the
>> slaves.
>>
>> Well, so long, I would appreciate your ideas.
>>
>> --
>> Paul E. Johnson
>> Professor, Political Science
>> 1541 Lilac Lane, Room 504
>> University of Kansas
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>
>
> --
> Department of Statistics & Actuarial Sciences
> Fax Phone#:(519)-661-3813
> The University of Western Ontario
> Office Phone#:(519)-661-3622
> London, Ontario N6A 5B7
> http://www.stats.uwo.ca/faculty/yu
>



-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas



More information about the R-sig-hpc mailing list