[R-sig-hpc] Cluster R "environment" trouble. Using Rmpi

Sat Aug 14 09:57:39 CEST 2010

Hi, everybody.

A user came in with a problem on our Rocks Linux Cluster. His function
runs fine in an interactive session, but when he sends the function to
compute nodes with Rmpi, they never return.  I'd not seen that before.
 We are sending out a few big tasks to a few nodes.

So I took his code, which is hundreds of lines long, spread across 4
files, and I've been staring at it for hours.  It makes me wonder ...

Question 1. How do auxiliary functions find their way onto compute nodes?

On the master, this sends "SimJob" to the compute nodes. SimJob is
inside "SimJob.R", as is "pars".  But if SimJob calls other functions,
how does the compute node find them?

############################################
library(Rmpi)
mpi.spawn.Rslaves(nslaves=4)

source("SimJob.R")
pars

ExitStatus <- mpi.parApply(pars, MARGIN=1, fun=SimJob)
cat("\n",table(ExitStatus),"\n")

mpi.close.Rslaves()
mpi.quit()
############################################

The SimJob.R does lots of things, it creates the object "pars" and
many other functions and definitions.

 "SimJob.R" has some interlinked functions like this:

pre1 <- function(i)   {  whatever; source("someFile.R") }

pre2 <- function (j, something) {  whatever(something);
source("someOtherFile.R") }

pre3 <- function(i) { whatever }

SimJob <- function(x,i, j){
    result1 <-  pre(i)
    result2 <- pre2(j, result1)
    result3 <- someRFunction(result1, result2)
}

someRFunction is in an R package, say "lm" or something like that.

How does a compute node  get functions "pre" and "pre2" and the files
they source?

What if the implementation of pre2 calls some function pre3?

We ARE on an NFS system with home folder available on all compute
nodes.  But the compute nodes don't inherit the working directory of
the master, do they?

Here's the frustrating part. I can run interactively on the master

> SimJob( pars[1, ] )

But the whole job won't run on the compute nodes.

2. Suppose a function that we send to a node tries to write a result.
It has "save(whatever,file="blha.Rda")  in it.   Where does that file
go?  What is the "current working directory" on the compute node?

I think that we have to re-write this so we return the information to
the master node and save it there.

3. Is there a way I can find out what is going on "over there" on a
compute node while it is working?

I wish I could put a bunch of print statements in so I could track the
thing's progress, but don't know  how to monitor them.

When this program runs interactively, it spits out some messages to
StdOut.  On a compute node, where do those go?

I've used the web program "ganglia" to see that nodes are actually
being used.  They are, using lots of CPU.

I've re-worked this code so that it  is all in one file (no more use
of source).  Still the same thing.

I can run SimJob () on the interactively,  but it never runs on the slaves.

Well, so long, I would appreciate your ideas.

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas