[R-sig-hpc] R-sig-hpc Digest, Vol 61, Issue 4

Lockwood, Glenn glock at sdsc.edu
Tue Oct 22 20:46:35 CEST 2013


Srihari,

mpd is a part of mpich (and mvapich/intel mpi), not OpenMPI, so your application probably isn't running the way you anticipated at all.  This is probably due to your login environment ($PATH, $LD_LIBRARY_PATH, etc) not being set up to use your new OpenMPI installation.

The errors you sent in the previous post sound like system-specific issues that require information specific to your cluster--it may be more effective to work with your local cluster admin (if you have one!) to iron out the issues arising from your MPI stack.  I don't think the problem is with R, but just take a step back and make sure you have the necessary essentials:

1. Your MPI stack needs to be compiled with a specific compiler (it sounds like you have Intel and GCC)
2. Your R needs to also be compiled with that same compiler
3. Your Rmpi needs to be compiled against the MPI stack in #1 and the same compiler as #1 and #2 (this may be non-trivial)
4. All of these need to be accessible from your local environment (either via some "module" command, which is common on most clusters, or by correctly setting $PATH, $LD_LIBRARY_PATH, etc in your .bashrc or .cshrc)

If you are missing (or unsure) of any of these points, your local cluster admin is probably best qualified to sort these out for you.  Good luck!

Glenn


On Oct 22, 2013, at 10:05 AM, Srihari Radhakrishnan <srihari at iastate.edu> wrote:

> Update:
> 
> I just installed a local version of Open MPI (1.6.5) and reran the job.
> Here's the output from the log -
> 
>> source("http://bioconductor.org/biocLite.R")
>> #MPI stuff initialization
>> library(Rmpi)
>> library(foreach)
>> library(doMPI)
>> cl <- startMPIcluster(count=31)
> mpiexec_node228: cannot connect to local mpd (/tmp/mpd2.console_srihari);
> possible cau
> ses:
>  1. no mpd is running on this host
>  2. an mpd is running but was started without a "console" (-n option)
> 
> I did not realize I had to start mpdboot before running mpirun - so I
> sandwiched the mpirun command with:
> 
> mpdboot -f ${PBS_NODEFILE}  -n 2
> time mpirun -n 1 R --slave -f ParallelAnalysis.R
> mpdallexit
> 
> When I do that, here's what I run into:
> 
> [unset]: Command cmd=put kvsname=singinit_kvs_3676_0 key=P0-businesscard-1
> value=rdma_
> port0#3676$rdma_host0#2:0:0:172:16:0:228:0:0:0:0:0:0:0:0$fabrics_list#shm_and_dapl$
> failed, reason='"'singinit_kvs_3676_0'"'
> Error in mpi.comm.spawn(slave = rscript, slavearg = args, nslaves = count,
> :
>  Other MPI error, error stack:
> MPI_Comm_spawn(170).............: MPI_Comm_spawn(cmd="/work/srihari/program
> s/R-3.0.1/lib64/R/bin/Rscript", argv=0x2982680, maxprocs=31, MPI_INFO_NULL,
> root=0, MP
> I_COMM_SELF, intercomm=0x25c03f0, errors=0x2e25bd0) failed
> MPIDI_Comm_spawn_multiple(123)..:
> MPIDI_CH3_Dynamic_processes(145):
> MPID_nem_reinit(1363)...........:
> MPIDI_PG_SetConnInfo_async(780).: PMI_KVS_Put returned -1
> Calls: startMPIcluster -> mpi.comm.spawn -> .Call
> Execution halted
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> 
> Pretty lost at the moment - any pointers would greatly help me!
> 
> Thanks
> Srihari
> 
> On Tue, Oct 22, 2013 at 10:13 AM, Srihari Radhakrishnan <srihari at iastate.edu
>> wrote:
> 
>> Thanks, Glenn and Steve for your input!
>> 
>> As you suggested, I switched to the Open MPI implementation of mpirun -
>> but I get another error this time:
>> 
>> *[node205:02122] [[63174,1],3] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a*
>> *process whose contact information is unknown in file rml_oob_send.c at
>> line 104*
>> *[node205:02122] [[63174,1],3] could not get route to [[INVALID],INVALID]*
>> 
>> A little bit of googling told me to set my LD_LIBRARY_PATH variable to
>> include the openmpi libs - which I added to my ~/.bashrc as follows:
>> 
>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/shared/openmpi-1.6.4/gcc-4.4.6/lib/
>> export LD_LIBRARY_PATH
>> 
>> I also sourced my ~/.bashrc file in the first line of my PBS script so the
>> LD_LIBRARY_PATH is reflected in all of the spawned workers as well. No luck
>> still - I still get the same error listed above.
>> 
>> Steve, I also tried running mpirun without the "-n 1" parameter - but like
>> you said, it spawned off multiple master processes, just like Glenn said
>> would happen. Running startMPIcluster() without the count=31 argument
>> didn't change the respective errors with either compiler (intel/openmpi).
>> 
>> Now, I am not sure how the cluster admin folks compiled openmpi - if they
>> did it with intel compilers or gcc is not known to me at the moment. But if
>> they compiled openmpi with intel compilers, could that be part of the
>> problem?
>> 
>> I sincerely appreciate all the help!
>> 
>> 
>> On Tue, Oct 22, 2013 at 5:00 AM, <r-sig-hpc-request at r-project.org> wrote:
>> 
>>> Send R-sig-hpc mailing list submissions to
>>>        r-sig-hpc at r-project.org
>>> 
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>        https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> or, via email, send a message with subject or body 'help' to
>>>        r-sig-hpc-request at r-project.org
>>> 
>>> You can reach the person managing the list at
>>>        r-sig-hpc-owner at r-project.org
>>> 
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of R-sig-hpc digest..."
>>> 
>>> 
>>> Today's Topics:
>>> 
>>>   1. Help with doMPI on multiple cores on a cluster
>>>      (Srihari Radhakrishnan)
>>>   2. Re: Help with doMPI on multiple cores on a cluster
>>>      (Stephen Weston)
>>>   3. Re: Help with doMPI on multiple cores on a cluster
>>>      (Lockwood, Glenn)
>>> 
>>> 
>>> ----------------------------------------------------------------------
>>> 
>>> Message: 1
>>> Date: Mon, 21 Oct 2013 08:42:35 -0500
>>> From: Srihari Radhakrishnan <srihari at iastate.edu>
>>> To: r-sig-hpc at r-project.org
>>> Subject: [R-sig-hpc] Help with doMPI on multiple cores on a cluster
>>> Message-ID:
>>>        <CACq2iiM0NRo=
>>> 12JGzquezyrHd7WyDcL9K_O+uSg50itpJ8f9OQ at mail.gmail.com>
>>> Content-Type: text/plain
>>> 
>>> Hi,
>>> 
>>> I've been trying to use the doMPI to run the iterations of a for loop in
>>> parallel (using the foreach package) on a cluster. However, I've been
>>> running into issues - I think its the way I am running the R script, but I
>>> could be wrong. Here's the description of the problem.
>>> 
>>> We use a PBS scheduler to submit jobs; my script uses 2 nodes (32 cores)
>>> for now. I run 1 version of the R interpreter which internally calls 31
>>> workers using R's mpi libraries. I produce below the PBS script, the R
>>> code
>>> (the relevant bits) and the error.
>>> 
>>> ***Begin PBS Script***
>>> #!/bin/bash
>>> 
>>> #PBS  -o BATCH_OUTPUT
>>> #PBS  -e BATCH_ERRORS
>>> 
>>> #PBS -lnodes=2:ppn=16:compute,walltime=12:00:00
>>> 
>>> # Change to directory from which qsub command was issued
>>>   cd $PBS_O_WORKDIR
>>> 
>>> cat $PBS_NODEFILE
>>> #Call mpirun with 1 copy of the R interpreter. This will spawn 31 workers,
>>> inside the R script
>>> time mpirun -n 1 R --slave -f ParallelAnalysis.R
>>> ***End PBS script***
>>> 
>>> ***Begin R Script***
>>> source("http://bioconductor.org/biocLite.R")
>>> #MPI stuff initialization
>>> library(Rmpi)
>>> library(foreach)
>>> library(doMPI)
>>> cl <- startMPIcluster(count=31) #call 31 clusterworkers/slaves
>>> registerDoMPI(cl)
>>> library(MEDIPS)
>>> library(BSgenome)
>>> .
>>> .
>>> *more R code; variable assignments etc; no mpi stuff here*
>>> .
>>> .
>>> #Following code will run 100 parallel iterations using the doMPI library
>>> loaded above and output results to the variable x. x is a table and stores
>>> results from iterations as rows.
>>> x <-foreach(i=1:100,.combine='rbind') %dopar% {
>>> *stuff to do inside loop*
>>> }
>>> write.table(x, "output.tsv") #write x into file.
>>> ***End R script***
>>> 
>>> The execution halts as soon as the libraries are loaded - I get the
>>> following error message repeatedly from both nodes (node 203 and node 202)
>>> 
>>> *[2:node203] unexpected disconnect completion event from dynamic process
>>> with rank=0 pg*
>>> *_id=kvs_17890_0 0x1fce600*
>>> *Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0*
>>> 
>>> I am not sure if this is an issue with the compilers or the script itself.
>>> The script runs successfully without using mpi (using only 1 node). Any
>>> help would be highly appreciated.
>>> 
>>> Thanks in advance,
>>> Srihari
>>> 
>>> --
>>> Srihari Radhakrishnan
>>> 
>>> Ph.D candidate
>>> Valenzuela Lab
>>> Iowa State University
>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>> 
>>> 
>>> ------------------------------
>>> 
>>> Message: 2
>>> Date: Mon, 21 Oct 2013 09:55:55 -0400
>>> From: Stephen Weston <stephen.b.weston at gmail.com>
>>> To: Srihari Radhakrishnan <srihari at iastate.edu>
>>> Cc: "R-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
>>> Subject: Re: [R-sig-hpc] Help with doMPI on multiple cores on a
>>>        cluster
>>> Message-ID:
>>>        <CALh21iK7Zap9UYstvxYm7sjBLX0=
>>> P3KzZUE9OLkDzfdkQjg89w at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>> 
>>> Hi Srihari,
>>> 
>>> I suspect it's an MPI issue.  Are you able to run any other simple MPI
>>> programs successfully, and specifically, any using R with Rmpi?  From
>>> the error message, it appears that you're using Intel MPI, which I've
>>> never used.  I believe Rmpi is primarily tested with Open MPI, which
>>> is what I've always used with doMPI.  It would be interesting to see
>>> if you can run successfully using Open MPI, if that is possible for
>>> you.
>>> 
>>> You'll probably need to look for help on an Intel MPI forum, although
>>> you may need to reduce the problem to something that doesn't use R.
>>> 
>>> Here is a similar issue that I found on an Intel MPI forum:
>>> 
>>>    http://software.intel.com/en-us/forums/topic/329053
>>> 
>>> You could also try running without spawning, since that may be a
>>> problem for Intel MPI.  To do that, change the R script to use:
>>> 
>>>    cl <- startMPIcluster()
>>> 
>>> Also change the mpirun command in the PBS script to use '-n 32' or
>>> don't specify the -n option at all.  In that case, mpirun will start
>>> all of the workers as well as the master which may work better.
>>> 
>>> Regards,
>>> 
>>> Steve Weston
>>> 
>>> On Mon, Oct 21, 2013 at 9:42 AM, Srihari Radhakrishnan
>>> <srihari at iastate.edu> wrote:
>>>> Hi,
>>>> 
>>>> I've been trying to use the doMPI to run the iterations of a for loop in
>>>> parallel (using the foreach package) on a cluster. However, I've been
>>>> running into issues - I think its the way I am running the R script,
>>> but I
>>>> could be wrong. Here's the description of the problem.
>>>> 
>>>> We use a PBS scheduler to submit jobs; my script uses 2 nodes (32 cores)
>>>> for now. I run 1 version of the R interpreter which internally calls 31
>>>> workers using R's mpi libraries. I produce below the PBS script, the R
>>> code
>>>> (the relevant bits) and the error.
>>>> 
>>>> ***Begin PBS Script***
>>>> #!/bin/bash
>>>> 
>>>> #PBS  -o BATCH_OUTPUT
>>>> #PBS  -e BATCH_ERRORS
>>>> 
>>>> #PBS -lnodes=2:ppn=16:compute,walltime=12:00:00
>>>> 
>>>> # Change to directory from which qsub command was issued
>>>>   cd $PBS_O_WORKDIR
>>>> 
>>>> cat $PBS_NODEFILE
>>>> #Call mpirun with 1 copy of the R interpreter. This will spawn 31
>>> workers,
>>>> inside the R script
>>>> time mpirun -n 1 R --slave -f ParallelAnalysis.R
>>>> ***End PBS script***
>>>> 
>>>> ***Begin R Script***
>>>> source("http://bioconductor.org/biocLite.R")
>>>> #MPI stuff initialization
>>>> library(Rmpi)
>>>> library(foreach)
>>>> library(doMPI)
>>>> cl <- startMPIcluster(count=31) #call 31 clusterworkers/slaves
>>>> registerDoMPI(cl)
>>>> library(MEDIPS)
>>>> library(BSgenome)
>>>> .
>>>> .
>>>> *more R code; variable assignments etc; no mpi stuff here*
>>>> .
>>>> .
>>>> #Following code will run 100 parallel iterations using the doMPI library
>>>> loaded above and output results to the variable x. x is a table and
>>> stores
>>>> results from iterations as rows.
>>>> x <-foreach(i=1:100,.combine='rbind') %dopar% {
>>>> *stuff to do inside loop*
>>>> }
>>>> write.table(x, "output.tsv") #write x into file.
>>>> ***End R script***
>>>> 
>>>> The execution halts as soon as the libraries are loaded - I get the
>>>> following error message repeatedly from both nodes (node 203 and node
>>> 202)
>>>> 
>>>> *[2:node203] unexpected disconnect completion event from dynamic process
>>>> with rank=0 pg*
>>>> *_id=kvs_17890_0 0x1fce600*
>>>> *Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0*
>>>> 
>>>> I am not sure if this is an issue with the compilers or the script
>>> itself.
>>>> The script runs successfully without using mpi (using only 1 node). Any
>>>> help would be highly appreciated.
>>>> 
>>>> Thanks in advance,
>>>> Srihari
>>>> 
>>>> --
>>>> Srihari Radhakrishnan
>>>> 
>>>> Ph.D candidate
>>>> Valenzuela Lab
>>>> Iowa State University
>>>> 
>>>>        [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> R-sig-hpc mailing list
>>>> R-sig-hpc at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> 
>>> 
>>> 
>>> ------------------------------
>>> 
>>> Message: 3
>>> Date: Mon, 21 Oct 2013 15:56:59 +0000
>>> From: "Lockwood, Glenn" <glock at sdsc.edu>
>>> To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
>>> Subject: Re: [R-sig-hpc] Help with doMPI on multiple cores on a
>>>        cluster
>>> Message-ID: <5D6CCDC1-CBB6-4926-8022-B19EC68CDAC5 at sdsc.edu>
>>> Content-Type: text/plain; charset="us-ascii"
>>> 
>>> I'd like to echo Steve's advice--try using OpenMPI instead.  I've had
>>> innumerable problems trying to get mvapich2 (on which IntelMPI is based)
>>> and Rmpi to work, and officially, Rmpi only supports openmpi and mpich.
>>> It's an uphill battle.
>>> 
>>> Also, you should be running mpirun with -n 1 (as you are already doing)
>>> if you are calling R directly.  Doing anything else will cause multiple
>>> master scripts to run, each spawning its own set of mpi ranks and leaving
>>> you with a lot more MPI ranks than you want.  Some libraries provide
>>> special wrappers that let you call them directly using mpirun -np 32 (e.g.,
>>> snow provies the RMPISNOW command), but these are unique to each library
>>> whereas using mpirun -n1 is universal.
>>> 
>>> Glenn
>>> 
>>> On Oct 21, 2013, at 6:55 AM, Stephen Weston <stephen.b.weston at gmail.com>
>>> wrote:
>>> 
>>>> Hi Srihari,
>>>> 
>>>> I suspect it's an MPI issue.  Are you able to run any other simple MPI
>>>> programs successfully, and specifically, any using R with Rmpi?  From
>>>> the error message, it appears that you're using Intel MPI, which I've
>>>> never used.  I believe Rmpi is primarily tested with Open MPI, which
>>>> is what I've always used with doMPI.  It would be interesting to see
>>>> if you can run successfully using Open MPI, if that is possible for
>>>> you.
>>>> 
>>>> You'll probably need to look for help on an Intel MPI forum, although
>>>> you may need to reduce the problem to something that doesn't use R.
>>>> 
>>>> Here is a similar issue that I found on an Intel MPI forum:
>>>> 
>>>>   http://software.intel.com/en-us/forums/topic/329053
>>>> 
>>>> You could also try running without spawning, since that may be a
>>>> problem for Intel MPI.  To do that, change the R script to use:
>>>> 
>>>>   cl <- startMPIcluster()
>>>> 
>>>> Also change the mpirun command in the PBS script to use '-n 32' or
>>>> don't specify the -n option at all.  In that case, mpirun will start
>>>> all of the workers as well as the master which may work better.
>>>> 
>>>> Regards,
>>>> 
>>>> Steve Weston
>>>> 
>>>> On Mon, Oct 21, 2013 at 9:42 AM, Srihari Radhakrishnan
>>>> <srihari at iastate.edu> wrote:
>>>>> Hi,
>>>>> 
>>>>> I've been trying to use the doMPI to run the iterations of a for loop
>>> in
>>>>> parallel (using the foreach package) on a cluster. However, I've been
>>>>> running into issues - I think its the way I am running the R script,
>>> but I
>>>>> could be wrong. Here's the description of the problem.
>>>>> 
>>>>> We use a PBS scheduler to submit jobs; my script uses 2 nodes (32
>>> cores)
>>>>> for now. I run 1 version of the R interpreter which internally calls 31
>>>>> workers using R's mpi libraries. I produce below the PBS script, the R
>>> code
>>>>> (the relevant bits) and the error.
>>>>> 
>>>>> ***Begin PBS Script***
>>>>> #!/bin/bash
>>>>> 
>>>>> #PBS  -o BATCH_OUTPUT
>>>>> #PBS  -e BATCH_ERRORS
>>>>> 
>>>>> #PBS -lnodes=2:ppn=16:compute,walltime=12:00:00
>>>>> 
>>>>> # Change to directory from which qsub command was issued
>>>>>  cd $PBS_O_WORKDIR
>>>>> 
>>>>> cat $PBS_NODEFILE
>>>>> #Call mpirun with 1 copy of the R interpreter. This will spawn 31
>>> workers,
>>>>> inside the R script
>>>>> time mpirun -n 1 R --slave -f ParallelAnalysis.R
>>>>> ***End PBS script***
>>>>> 
>>>>> ***Begin R Script***
>>>>> source("http://bioconductor.org/biocLite.R")
>>>>> #MPI stuff initialization
>>>>> library(Rmpi)
>>>>> library(foreach)
>>>>> library(doMPI)
>>>>> cl <- startMPIcluster(count=31) #call 31 clusterworkers/slaves
>>>>> registerDoMPI(cl)
>>>>> library(MEDIPS)
>>>>> library(BSgenome)
>>>>> .
>>>>> .
>>>>> *more R code; variable assignments etc; no mpi stuff here*
>>>>> .
>>>>> .
>>>>> #Following code will run 100 parallel iterations using the doMPI
>>> library
>>>>> loaded above and output results to the variable x. x is a table and
>>> stores
>>>>> results from iterations as rows.
>>>>> x <-foreach(i=1:100,.combine='rbind') %dopar% {
>>>>> *stuff to do inside loop*
>>>>> }
>>>>> write.table(x, "output.tsv") #write x into file.
>>>>> ***End R script***
>>>>> 
>>>>> The execution halts as soon as the libraries are loaded - I get the
>>>>> following error message repeatedly from both nodes (node 203 and node
>>> 202)
>>>>> 
>>>>> *[2:node203] unexpected disconnect completion event from dynamic
>>> process
>>>>> with rank=0 pg*
>>>>> *_id=kvs_17890_0 0x1fce600*
>>>>> *Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0*
>>>>> 
>>>>> I am not sure if this is an issue with the compilers or the script
>>> itself.
>>>>> The script runs successfully without using mpi (using only 1 node). Any
>>>>> help would be highly appreciated.
>>>>> 
>>>>> Thanks in advance,
>>>>> Srihari
>>>>> 
>>>>> --
>>>>> Srihari Radhakrishnan
>>>>> 
>>>>> Ph.D candidate
>>>>> Valenzuela Lab
>>>>> Iowa State University
>>>>> 
>>>>>       [[alternative HTML version deleted]]
>>>>> 
>>>>> _______________________________________________
>>>>> R-sig-hpc mailing list
>>>>> R-sig-hpc at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>> 
>>>> _______________________________________________
>>>> R-sig-hpc mailing list
>>>> R-sig-hpc at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> 
>>> 
>>> 
>>> ------------------------------
>>> 
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> 
>>> 
>>> End of R-sig-hpc Digest, Vol 61, Issue 4
>>> ****************************************
>>> 
>> 
>> 
>> 
>> --
>> Srihari Radhakrishnan
>> 
>> Ph.D candidate
>> Valenzuela Lab
>> Iowa State University
>> 
>> Website: http://www.public.iastate.edu/~srihari/
>> 
> 
> 
> 
> -- 
> Srihari Radhakrishnan
> 
> Ph.D candidate
> Valenzuela Lab
> Iowa State University
> 
> Website: http://www.public.iastate.edu/~srihari/
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc



More information about the R-sig-hpc mailing list