[R-sig-hpc] problems with parallel launch under Rmpi/OpenMPI

Ross Boylan ross at biostat.ucsf.edu
Wed Jan 29 04:37:56 CET 2014


I was able to launch parallel R only as long as all the nodes are on the
same machine.  This is a report of the problem and the fix!  (I was
going to ask why it wasn't working, but now I'm just sharing
information.)  

GOAL: Have an interactive master session controlling slaves, so that I
can debug.  I want the slaves running Rmpi's standard command loop,
hence the use of Rmpi's Rprofile, in modified form, below.

WORKS IF ALL ON ONE HOST

The following works:
-----------------------------
ross at n10:~/KHC/sunbelt$ ./rmpilaunch
master (rank 0, comm 1) of size 7 is running on: n10
slave1 (rank 1, comm 1) of size 7 is running on: n10
slave2 (rank 2, comm 1) of size 7 is running on: n10
slave3 (rank 3, comm 1) of size 7 is running on: n10
# ... etc
---------------------------------

rmpilaunch is
---------------------------
#! /bin/sh
R_PROFILE_USER=~/KHC/sunbelt/Rmpiprofile orterun -np 7 R --no-save -q 
---------------------------

Rmpiprofile is a slightly modified version of the file distributed with
Rmpi.  It adds the following lines at the start:
----------------------------------
# This file was copied from ~/Rlib/Rmpi/Rprofile
# I do not take the advice to name it .Rprofile since I need to keep the
# existing file.  Set the R_PROFILE_USER to this file name to use it.
# First we invoke the existing startup file.  This is essential to get
# the paths for libraries, including Rmpi.
setwd("~/KHC/sunbelt")
source("~/.Rprofile")

## Standard code below here.
--------------------------------

~/.Rprofile sets some local options, in particular putting my personal R
directory at the front of .libPaths().  Without this R will not find a
loadable Rmpi.

FAILS WITH MULTIPLE HOSTS

However, if I add -hostfile <full path to hostfile> to the orterun line,
I never get any messages about master or slaves starting, and I don't
get the command line back.

----------------------------------------
ross at n10:~/KHC/sunbelt$ cat rmpilaunch
#! /bin/sh
R_PROFILE_USER=~/KHC/sunbelt/Rmpiprofile orterun -hostfile ~/KHC/sunbelt/hosts -np 7 R --no-save -q
ross at n10:~/KHC/sunbelt$ # note use of -hostfile.  expected to fail.
ross at n10:~/KHC/sunbelt$ ./rmpilaunch
>
>
>
>
-------------------------------------------------


SOLUTION

Tell mpi to export R_PROFILE_USER:
----------------------------------------------------------------------
ross at n10:~/KHC/sunbelt$ cat rmpilaunch
#! /bin/sh
R_PROFILE_USER=~/KHC/sunbelt/Rmpiprofile orterun -x R_PROFILE_USER -hostfile ~/KHC/sunbelt/hosts -np 7 R --no-save -q
ross at n10:~/KHC/sunbelt$ # exported the env variable
ross at n10:~/KHC/sunbelt$ ./rmpilaunch
master (rank 0, comm 1) of size 7 is running on: n10
slave1 (rank 1, comm 1) of size 7 is running on: n10
slave2 (rank 2, comm 1) of size 7 is running on: n10
slave3 (rank 3, comm 1) of size 7 is running on: n11
slave4 (rank 4, comm 1) of size 7 is running on: n11
slave5 (rank 5, comm 1) of size 7 is running on: n11
slave6 (rank 6, comm 1) of size 7 is running on: n11
----------------------------------------------------------------------

ANALYSIS

Apparently orterun does not export R_PROFILE_USER (maybe it would have
if I had export'ed it?).  So the remote slaves didn't get the right
startup file and so never even attempted to load Rmpi.

This explanation seems plausible, but it's possible it's wrong :)


ENVIRONMENT

Environment: Debian squeeze, openmpi 1.4.2-4, R 3.0.1-3~squeezecran3.0,
Rmpi 0.6.3 installed from source on CRAN (not Debian package) to my
local directory.  There is no batch queing system.  I am invoking
rmpilaunch from a shell running under emacs, and using ESS's ess-remote
to get into R mode when launch is successful.

This is quite similar to this thread
https://stat.ethz.ch/pipermail/r-sig-hpc/2009-February/000104.html.

Ross Boylan



More information about the R-sig-hpc mailing list