[R-sig-hpc] stopCluster hangs instead of exits

Bennet Fauber bennet @end|ng |rom um|ch@edu
Sat Nov 16 18:18:16 CET 2019


I have a small test program that uses only Rmpi functions and performs
a similar task, and it runs cleanly and to completion.

Rmpi-test.R
--------------------------------------------------------
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
    library("Rmpi")
}
# Spawn N-1 workers
mpi.spawn.Rslaves(nslaves=mpi.universe.size()-1)

# The command we want to run on all the nodes/processors we have
mpi.remote.exec(paste("I am ", mpi.comm.rank(), " of ",
                       mpi.comm.size(), " on ",
                      Sys.info()
                       [c("nodename")]
                     )
               )

# Stop the worker processes
mpi.close.Rslaves()

# Close down the MPI processes and quit R
mpi.quit()
--------------------------------------------------------

The MPI installation itself is a cluster installation, and many other
applications are using the MPI successfully, so I am pretty sure that
MPI is working.

The issue seems to be one of interaction between snow's stopCluster() and...?

Output from the above commands

> # Load the R MPI package if it is not already loaded.
> if (!is.loaded("mpi_initialize")) {
+     library("Rmpi")
+ }
> # Spawn N-1 workers
> paste(" There are ", mpi.universe.size(), " ranks in this universe")
[1] " There are  36  ranks in this universe"
> mpi.spawn.Rslaves(nslaves=3)
    3 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 4 is running on: gl-build
slave1 (rank 1, comm 1) of size 4 is running on: gl-build
slave2 (rank 2, comm 1) of size 4 is running on: gl-build
slave3 (rank 3, comm 1) of size 4 is running on: gl-build
>
> # The command we want to run on all the nodes/processors we have
> mpi.remote.exec(paste("I am ", mpi.comm.rank(), " of ",
+                        mpi.comm.size(), " on ",
+                        Sys.info()
+                        [c("nodename")]
+                      )
+                )
$slave1
[1] "I am  1  of  4  on  gl-build.arc-ts.umich.edu"

$slave2
[1] "I am  2  of  4  on  gl-build.arc-ts.umich.edu"

$slave3
[1] "I am  3  of  4  on  gl-build.arc-ts.umich.edu"

>
> # Stop the worker processes
> mpi.close.Rslaves()
[1] 1
>
> # Close down the MPI processes and quit R
> mpi.quit()

On Sat, Nov 16, 2019 at 12:11 PM Dirk Eddelbuettel <edd using debian.org> wrote:
>
>
> On 16 November 2019 at 11:59, Bennet Fauber wrote:
> | Any thoughts on what might be wrong and how I should go about fixing it?
>
> I would think that is an OpenMPI issue.
>
> My inclanation would be to try to replicate it with a pure C/C++ "hello MPI
> world" and check whether it returns cleanly or not when launched from
> `orterun` (or alike) with similar options.
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | edd using debian.org



More information about the R-sig-hpc mailing list