[R-sig-hpc] openmpi/rmpi/snow: current puzzles, possible improvements [diagnosis]

Ross Boylan ross at biostat.ucsf.edu
Sat May 16 01:07:52 CEST 2009


I think there were several things wrong.
1) I wasn't exporting R_PROFILE to the remote nodes.
2) R CMD BATCH's output file was the same file for all processes, given
NFS.
3) The remote nodes did not have Rmpi installed!

3) is obviously crucial; I'm not sure how significant the other problems
are.  I diagnosed it by changing the output file to /tmp/foo and running
only one job on each node.

Is there a good way to get unique file names per process on the command
line?  The only way I can think of is to determine the output file
inside the batch script invoked by mpirun and using an env variable, if
one is available (i.e., OpenMPI 1.3 or 1.2 in some scenarios)

My new invocation looks like this:
R_PROFILE=/usr/lib/R/site-library/snow/RMPISNOWprofile; export R_PROFILE
mpirun -np 2 -host n5,n7 -x R_PROFILE /usr/bin/R CMD BATCH silly.R

I think the R CMD BATCH will send output to stdout and mpi will redirect
to the invoking terminal.  Since I can't actually run because of 3),
this is speculative.

Ross

On Wed, 2009-05-13 at 21:52 -0700, Ross Boylan wrote:
> After reading through the thread around
> https://stat.ethz.ch/pipermail/r-sig-hpc/2009-February/000105.html, as
> well as looking at some other things, for ideas about running snow on
> top of Rmpi on Debian Lenny, I decided to try a shell script:
> ----------------------------------------------------------------
> R_PROFILE=/usr/lib/R/site-library/snow/RMPISNOWprofile; export R_PROFILE
> mpirun -np 6 -hostfile hosts R CMD BATCH snowjob.R snowjob.out
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> with this kind of snowjob.R:
> -------------------------------------------------------------------
> # This will only execute on the head node
> cl <- getMPIcluster()
> print(mpi.comm.rank(0))
> 
> quickinfo <- function() {
>   list(rank=mpi.comm.rank(0), machine=Sys.info()) #system("hostname"))
> }
> print(clusterCall(cl, quickinfo))
> stopCluster(cl)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> and hosts file
> -------------------
> n7 slots=3
> n5 slots=0  # changing this to 2 didn't help
> n4 slots=4
> ^^^^^^^^^^^^^^^^^^^
> 
> I'm on n7.
> 
> Two problems.
> 
> First, the job shown never terminates. snowjob.out shows the standard R
> banner, a standard harmless complaint, and then nothing (technically it
> shows 
> [n7:14829] OOB: Connection to HNP lost
> but I assume that is after I ^c my shell script).
> 
> I suspect the problem is that it's having trouble reaching the other
> nodes.
> 
> Second, if I have n7 slots=7 the job completes.  It shows everything on
> n7.  However, if I use machine=system("hostname") I get back null
> strings.  system("hostname") works fine interactively.
> 
> Perhaps this is some kind of quoting effect when system("hostname") is
> exported via clusterCall?  Or system() doesn't work under rmpi?
> 
> I'm also not sure why I am not running into a 3rd problem: it looks as
> if each process should be writing to the same file snowjob.out (via NFS
> mounts).  That doesn't seem to be happening.  Perhaps because the slave
> R's never make it out of the RMPISNOWProfile code?
> 
> If anyone has any thoughts or suggestions, I'd love to hear them.
> 
> Ross
> 
> P.S. The original problem is that, apparently, makeCluster(n,
> type="MPI") will not spawn jobs on other nodes--maybe even not more than
> one job spawned at all.  So I'm attempting to bring up snow within an
> mpi session.
> 
> I did notice the docs on MPI_COMM_SPAWN
> http://www.mpi-forum.org/docs/mpi21-report-bw/node202.htm#Node202
> indicate there is an info argument which could contain system-dependent
> information.  Presumably this could include a hostname; the standard
> explicitly leaves this to the implementation.  I couldn't find anything
> on the openmpi implementation.  I suppose the source would at least
> indicate what works now.
> 
> So, IF openmpi supports it, and if the interface is exposed through Rmpi
> (which does have mpi.info functions, which might be able to make the
> right arguments), there would be a possibility of handling this strictly
> within R.
> 

-- 
Ross Boylan                                      wk:  (415) 514-8146
185 Berry St #5700                               ross at biostat.ucsf.edu
Dept of Epidemiology and Biostatistics           fax: (415) 514-8150
University of California, San Francisco
San Francisco, CA 94107-1739                     hm:  (415) 550-1062



More information about the R-sig-hpc mailing list