[R-sig-hpc] errors in Rmpi programming. I get no error messages, just stalled programs. Would you test your cluster?

Paul Johnson pauljohn32 at gmail.com
Sat Feb 4 21:44:21 CET 2012


I've been working on R mpi programs for a couple of years, and, like
everybody else on the bleeding edge, I see a lot of weird crashes and
such

One problem that really concerns me is user error in R program code
that causes cluster runs to hang, rather than crash.  R and Rmpi seem
to not give error messages.  I do not know if this problem traces back
to OpenMPI, to the Torque cluster scheduler, or R, but I am really
interested to  know if you see it too in your parallel efforts.

Here it is in a nutshell.  The user forgets to send a function to the
compute nodes, but a function that uses that (not exported) function
gets called.

The user forgets to include this:

## objects to export to each node
clusterExport(cl, c("projSeeds", "useStream", "initSeedStreams"))

But in the function that is applied in the cluster, the
"initSeedStreams" function is used, like so:

runOneSimulation <- function(run, parm){
  initSeedStreams(run)
  ##then some gigantic, long lasting computation occurs
  dat <- data.frame(x1 = rnorm(parm$N), x2 = rnorm(parm$N), y = rnorm(parm$N))
  m1 <- lm(y ~ x1 + x2, data=dat)
  useStream(2)
  dat2 <- data.frame(x1 = rnorm(parm$N), x2 = rnorm(parm$N), y = rnorm(parm$N))
  m2 <- lm(y ~ x1 + x2, data=dat2)
  list(m1, summary(m1), model.matrix(m1), m2)
}
projParms <- list("N" = 999)
nReps <- 15  ##Note more reps than nodes

res <- snow:::clusterApplyLB(cl, 1:nReps, runOneSimulation, projParms )

On our cluster, what happens is that the master node CPU usage climbs
to 100%, but no work gets done.  The ganglia program gives a web view,
it shows the CPU time is not user calculations (which are blue in the
CPU bar chart), but instead the CPU time is 100% red, the system CPU
usage.  The program will sit there doing nothing until the walltime
requested in the submission script is used up.

This is bad/frustrating for me, but I'm used to making mistakes and
fixing them.  .  I'm used to making mistakes and am old enough to
remember programming for IBM mainframes when a coding mistake resulted
in a full page of ascii art saying "ABEND" on a gigantic
sprocket-printed piece of paper.  But for the students who are
learning to program and use Linux for the first time, it is like a
death sentence.  Too discouraging.

I've got example programs you can try to see if they cause the same
problem on your cluster.

This one is my proposal for the "multiple random streams" per repeated
simulation:

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/

If you have, say, 10 nodes, and you repeat a project 10000 times, that
one will use 10000 separate sets of seed streams, so any particular
repetition can be re-started from the same spot.

If your scheduler is different, you may have to modify the submission
script, but I *believe sincerely* that this program
"controlledSeeds.R" will run as provided, as long as you also have the
seedCreator.R in the same folder when you run it for the first time.
Either that, or you run seedCreator.R one time before submitting this
to the cluster. It writes the project seeds file "projSeeds.rda".
Suppose that runs happily:

Now, break the program.  Suppose that runs for you, then comment out
the line that exports the functions

clusterExport(cl, c("projSeeds", "useStream", "initSeedStreams"))

OR change it to something like

clusterExport(cl, c("projSeeds", "initSeedStreams"))

Then please let me know what happens on your cluster.  Does the job
just hang like a dead dog until its walltime runs out?

I hope to hear from results of your test.

pj

If you are just interested in testing out the various R parallel
things I've explored, I've got:

Rmpi:

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex53-HelloWorldRmpi/


Snow:

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex60-HelloWorldSnow/

SnowFT

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex61-HelloWorldSnowFT/

R parallel package (R 2.14.1 or newer)

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex65-R-parallel/

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas



More information about the R-sig-hpc mailing list