[R-sig-hpc] Help with doMPI on multiple cores on a cluster
    Lockwood, Glenn 
    glock at sdsc.edu
       
    Mon Oct 21 17:56:59 CEST 2013
    
    
  
I'd like to echo Steve's advice--try using OpenMPI instead.  I've had innumerable problems trying to get mvapich2 (on which IntelMPI is based) and Rmpi to work, and officially, Rmpi only supports openmpi and mpich.  It's an uphill battle.
Also, you should be running mpirun with -n 1 (as you are already doing) if you are calling R directly.  Doing anything else will cause multiple master scripts to run, each spawning its own set of mpi ranks and leaving you with a lot more MPI ranks than you want.  Some libraries provide special wrappers that let you call them directly using mpirun -np 32 (e.g., snow provies the RMPISNOW command), but these are unique to each library whereas using mpirun -n1 is universal.
Glenn
On Oct 21, 2013, at 6:55 AM, Stephen Weston <stephen.b.weston at gmail.com> wrote:
> Hi Srihari,
> 
> I suspect it's an MPI issue.  Are you able to run any other simple MPI
> programs successfully, and specifically, any using R with Rmpi?  From
> the error message, it appears that you're using Intel MPI, which I've
> never used.  I believe Rmpi is primarily tested with Open MPI, which
> is what I've always used with doMPI.  It would be interesting to see
> if you can run successfully using Open MPI, if that is possible for
> you.
> 
> You'll probably need to look for help on an Intel MPI forum, although
> you may need to reduce the problem to something that doesn't use R.
> 
> Here is a similar issue that I found on an Intel MPI forum:
> 
>    http://software.intel.com/en-us/forums/topic/329053
> 
> You could also try running without spawning, since that may be a
> problem for Intel MPI.  To do that, change the R script to use:
> 
>    cl <- startMPIcluster()
> 
> Also change the mpirun command in the PBS script to use '-n 32' or
> don't specify the -n option at all.  In that case, mpirun will start
> all of the workers as well as the master which may work better.
> 
> Regards,
> 
> Steve Weston
> 
> On Mon, Oct 21, 2013 at 9:42 AM, Srihari Radhakrishnan
> <srihari at iastate.edu> wrote:
>> Hi,
>> 
>> I've been trying to use the doMPI to run the iterations of a for loop in
>> parallel (using the foreach package) on a cluster. However, I've been
>> running into issues - I think its the way I am running the R script, but I
>> could be wrong. Here's the description of the problem.
>> 
>> We use a PBS scheduler to submit jobs; my script uses 2 nodes (32 cores)
>> for now. I run 1 version of the R interpreter which internally calls 31
>> workers using R's mpi libraries. I produce below the PBS script, the R code
>> (the relevant bits) and the error.
>> 
>> ***Begin PBS Script***
>> #!/bin/bash
>> 
>> #PBS  -o BATCH_OUTPUT
>> #PBS  -e BATCH_ERRORS
>> 
>> #PBS -lnodes=2:ppn=16:compute,walltime=12:00:00
>> 
>> # Change to directory from which qsub command was issued
>>   cd $PBS_O_WORKDIR
>> 
>> cat $PBS_NODEFILE
>> #Call mpirun with 1 copy of the R interpreter. This will spawn 31 workers,
>> inside the R script
>> time mpirun -n 1 R --slave -f ParallelAnalysis.R
>> ***End PBS script***
>> 
>> ***Begin R Script***
>> source("http://bioconductor.org/biocLite.R")
>> #MPI stuff initialization
>> library(Rmpi)
>> library(foreach)
>> library(doMPI)
>> cl <- startMPIcluster(count=31) #call 31 clusterworkers/slaves
>> registerDoMPI(cl)
>> library(MEDIPS)
>> library(BSgenome)
>> .
>> .
>> *more R code; variable assignments etc; no mpi stuff here*
>> .
>> .
>> #Following code will run 100 parallel iterations using the doMPI library
>> loaded above and output results to the variable x. x is a table and stores
>> results from iterations as rows.
>> x <-foreach(i=1:100,.combine='rbind') %dopar% {
>> *stuff to do inside loop*
>> }
>> write.table(x, "output.tsv") #write x into file.
>> ***End R script***
>> 
>> The execution halts as soon as the libraries are loaded - I get the
>> following error message repeatedly from both nodes (node 203 and node 202)
>> 
>> *[2:node203] unexpected disconnect completion event from dynamic process
>> with rank=0 pg*
>> *_id=kvs_17890_0 0x1fce600*
>> *Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0*
>> 
>> I am not sure if this is an issue with the compilers or the script itself.
>> The script runs successfully without using mpi (using only 1 node). Any
>> help would be highly appreciated.
>> 
>> Thanks in advance,
>> Srihari
>> 
>> --
>> Srihari Radhakrishnan
>> 
>> Ph.D candidate
>> Valenzuela Lab
>> Iowa State University
>> 
>>        [[alternative HTML version deleted]]
>> 
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> 
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
    
    
More information about the R-sig-hpc
mailing list