[R-sig-hpc] Rmpi with PBSPro and OpenMPI

Hao Yu hyu at stats.uwo.ca
Tue Mar 10 18:46:31 CET 2009


Hi Mark,

What is the version of Rmpi you are using? Version 0.5-5 or older had a
bug in Rprofile but it was solved since 0.5-6.

.Last never intends to be a way to close R slaves. It is only used when
some one doesn't close R salves and master properly. Here is what I
normally do
{karl:58}orterun -n 4 R --no-save -q
master (rank 0, comm 1) of size 4 is running on: karl
slave1 (rank 1, comm 1) of size 4 is running on: karl
slave2 (rank 2, comm 1) of size 4 is running on: karl
slave3 (rank 3, comm 1) of size 4 is running on: karl
> #real codes here ....
> mpi.close.Rslaves()
mpi.close.Rslaves()
[1] 1
> mpi.quit()
mpi.quit()

Please note that master and slaves are created from one communicator. They
live or die together, unlike spawning where master can live even slaves
quit.

Hao



Lyman, Mark wrote:
> I just recently discovered this list and thought I would ask a question
> about a mildly annoying issue. Generally, our setup works great,
> however, I had to modify the .Last function in the .Rprofile file that
> comes with Rmpi. The function now looks like this:
>         .Last <- function ()
>         {
>                 if (is.loaded("mpi_initialize")) {
>                         if (mpi.comm.size(1) > 1) {
>                                 mpi.bcast.cmd(q("no"))
>                         }
>                 }
>         }
>
> Without this modification, the R code is run successfully, but when
> mpi.quit/mpi.exit/mpi.finalize are run everything stops. It seems that
> the slaves are not being shut down appropriately, and the master never
> gets the signal it is waiting for that the slaves have shut down. Has
> anyone else had this issue and solved it? Or does anyone know what could
> be the cause?
>
> I'm not sure, but I'm afraid that this is related to the following error
> that I occasionally get from OpenMPI:
>
> [n087:30298] [0,0,0] mca_oob_tcp_recv_handler: invalid message type: 0
> [n039:29963] [0,1,65]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv()
> failed with errno=104
> [n087:30298] [0,0,0] mca_oob_tcp_recv_handler: invalid message type: 0
> [n039:29962] [0,1,64]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv()
> failed with errno=104
> [n087:30298] [0,0,0] mca_oob_tcp_recv_handler: invalid message type: 0
> [n039:29964] [0,1,66]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv()
> failed with errno=104
> [n087:30298] [0,0,0] mca_oob_tcp_recv_handler: invalid message type: 0
> [n039:29965] [0,1,67]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv()
> failed with errno=104
>
> Usually, I am able to kill and retry the job and everything works fine,
> but sometimes it can fail repeatedly. Please let me know if any more
> information is needed. As you can see, I am a statistician, and I am
> very new to HPC.
>
> Mark Lyman, Statistician
> Engineering Systems & Integration, ATK
> (435) 863-2863
>
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to
> say what the experiment died of.
>
> Sir Ronald Aylmer Fisher
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>


-- 
Department of Statistics & Actuarial Sciences
Fax Phone#:(519)-661-3813
The University of Western Ontario
Office Phone#:(519)-661-3622
London, Ontario N6A 5B7
http://www.stats.uwo.ca/faculty/yu



More information about the R-sig-hpc mailing list