[R-sig-hpc] Cluster R "environment" trouble. Using Rmpi
Paul Johnson
pauljohn32 at gmail.com
Tue Sep 7 04:27:29 CEST 2010
Dear Professor:
Thanks very much for the feedback. That has helped me to cut errors
and get one of the user programs working. One program is still
causing trouble, and I have a couple of questions below. I am sorry
that these are so elementary, but if I can understand this, then I can
write up some clear working examples for everybody.
On Wed, Aug 18, 2010 at 1:05 PM, Hao Yu <hyu at stats.uwo.ca> wrote:
> Hi Paul,
>
> Just got back from two conferences.
>
> First of all, when R slaves are spawned, they are "naked", meaning they
> are started with basic R functions/lib even that they are in the same dir
> with master. You have to tell slaves to get all necessary objects or to
> load libraries specifically. There are a few ways to do so.
>
> Use mpi.bcast.Robj2slave(an Robj) to send "an Robj" from master to all
> slaves. If a function to be executed on slaves depends on many
> functions/data, those functions/data must be sent to slaves first.
>
> Use mpi.bcast.cmd (cmd()) to tell salves to run cmd() like
> source("SimJob.R") (make sure to remove any execution commands in
> SimJob.R).
Can you please explain what "execution commands" means here?
I am *guessing* that anything that is forbidden in a "sourced" file is
also forbidden in a function that is passed to a node. Right?
Should it work to hide the source command inside a function, as in:
myThing <- function (i = 0){
source("someGreatCode.R")
## blah blah
}
mpi.bcast.Robj2slave( myThing )
mpi.bcast.cmd( myThing )
Is that supposed to work?
You mention a "race problem." Can I ask how I could tell if I have
that problem?
Here's why I ask. One user has a job that uses source() and it runs
very slowly. It may take 1000 x as long as it would if the separate
parts were sent without Rmpi. I can't understand why it works at all,
frankly. Unfortunately, his code is private and I can't share it to
you, but I'm trying to build a test case to reproduce the problem. If
you tell me how to diagnose the race question, then perhaps I can
figure it out.
In the other cases I've tested, my mistakes cause crashes and there is
an error message from a node saying that a function is not defined, so
I know I have to send it to the nodes. If I can get a program to
crash, I can usually fix it. It is the ones that run forever, or
almost forever, that I can't fix.
pj
> I don't know if race condition will be an issue since slaves
> are competing for the same file.
>
> mpi.scatter.Robj/mpi.gather.Rojb can also be used to send/receive objects
> among master and slaves.
>
> Hao
>
>
> Paul Johnson wrote:
>> Hi, everybody.
>>
>> A user came in with a problem on our Rocks Linux Cluster. His function
>> runs fine in an interactive session, but when he sends the function to
>> compute nodes with Rmpi, they never return. I'd not seen that before.
>> We are sending out a few big tasks to a few nodes.
>>
>> So I took his code, which is hundreds of lines long, spread across 4
>> files, and I've been staring at it for hours. It makes me wonder ...
>>
>> Question 1. How do auxiliary functions find their way onto compute nodes?
>>
>> On the master, this sends "SimJob" to the compute nodes. SimJob is
>> inside "SimJob.R", as is "pars". But if SimJob calls other functions,
>> how does the compute node find them?
>>
>> ############################################
>> library(Rmpi)
>> mpi.spawn.Rslaves(nslaves=4)
>>
>> source("SimJob.R")
>> pars
>>
>> ExitStatus <- mpi.parApply(pars, MARGIN=1, fun=SimJob)
>> cat("\n",table(ExitStatus),"\n")
>>
>> mpi.close.Rslaves()
>> mpi.quit()
>> ############################################
>>
>> The SimJob.R does lots of things, it creates the object "pars" and
>> many other functions and definitions.
>>
>> "SimJob.R" has some interlinked functions like this:
>>
>> pre1 <- function(i) { whatever; source("someFile.R") }
>>
>> pre2 <- function (j, something) { whatever(something);
>> source("someOtherFile.R") }
>>
>> pre3 <- function(i) { whatever }
>>
>> SimJob <- function(x,i, j){
>> result1 <- pre(i)
>> result2 <- pre2(j, result1)
>> result3 <- someRFunction(result1, result2)
>> }
>>
>> someRFunction is in an R package, say "lm" or something like that.
>>
>> How does a compute node get functions "pre" and "pre2" and the files
>> they source?
>>
>> What if the implementation of pre2 calls some function pre3?
>>
>> We ARE on an NFS system with home folder available on all compute
>> nodes. But the compute nodes don't inherit the working directory of
>> the master, do they?
>>
>> Here's the frustrating part. I can run interactively on the master
>>
>>> SimJob( pars[1, ] )
>>
>> But the whole job won't run on the compute nodes.
>>
>> 2. Suppose a function that we send to a node tries to write a result.
>> It has "save(whatever,file="blha.Rda") in it. Where does that file
>> go? What is the "current working directory" on the compute node?
>>
>> I think that we have to re-write this so we return the information to
>> the master node and save it there.
>>
>>
>> 3. Is there a way I can find out what is going on "over there" on a
>> compute node while it is working?
>>
>> I wish I could put a bunch of print statements in so I could track the
>> thing's progress, but don't know how to monitor them.
>>
>> When this program runs interactively, it spits out some messages to
>> StdOut. On a compute node, where do those go?
>>
>> I've used the web program "ganglia" to see that nodes are actually
>> being used. They are, using lots of CPU.
>>
>>
>> I've re-worked this code so that it is all in one file (no more use
>> of source). Still the same thing.
>>
>> I can run SimJob () on the interactively, but it never runs on the
>> slaves.
>>
>> Well, so long, I would appreciate your ideas.
>>
>> --
>> Paul E. Johnson
>> Professor, Political Science
>> 1541 Lilac Lane, Room 504
>> University of Kansas
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>
>
> --
> Department of Statistics & Actuarial Sciences
> Fax Phone#:(519)-661-3813
> The University of Western Ontario
> Office Phone#:(519)-661-3622
> London, Ontario N6A 5B7
> http://www.stats.uwo.ca/faculty/yu
>
--
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas
More information about the R-sig-hpc
mailing list