[R-sig-hpc] stopCluster hangs instead of exits

Bennet Fauber bennet @end|ng |rom um|ch@edu
Sat Nov 16 17:59:44 CET 2019


We have a newish installation and are having some issues with
stopCluster() hanging when the cluster object is created using

    cl <- makeMPIcluster(5)

from snow.

The base R is 3.6.1.  The version of Rmpi is 0.6-9.  The version of
OpenMPI against which Rmpi was installed is 3.1.4.

The makeMPIcluster() seems to work, and processes are created.  They
look like this, for example,

bennet    26330  16163  0 11:07 pts/15   00:00:00 mpirun -np 1 Rmpi
--no-restore --no-save

bennet    26369  26330 99 11:07 pts/15   00:00:23
/sw/arcts/centos7/stacks/gcc/8.2.0/R/3.6.1/lib64/R/bin/exec/R --slave
--no-restore --file=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1/snow/RMPInode.R
--args SNOWLIB=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1
OUT=/dev/null

bennet    26370  26330 99 11:07 pts/15   00:00:23
/sw/arcts/centos7/stacks/gcc/8.2.0/R/3.6.1/lib64/R/bin/exec/R --slave
--no-restore --file=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1/snow/RMPInode.R
--args SNOWLIB=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1
OUT=/dev/null

bennet    26371  26330 99 11:07 pts/15   00:00:23
/sw/arcts/centos7/stacks/gcc/8.2.0/R/3.6.1/lib64/R/bin/exec/R --slave
--no-restore --file=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1/snow/RMPInode.R
--args SNOWLIB=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1
OUT=/dev/null

bennet    26372  26330 99 11:07 pts/15   00:00:23
/sw/arcts/centos7/stacks/gcc/8.2.0/R/3.6.1/lib64/R/bin/exec/R --slave
--no-restore --file=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1/snow/RMPInode.R
--args SNOWLIB=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1
OUT=/dev/null

They seem able to do work and communicate OK.  The only issue comes
when stopCluster(cl) is called, at which point R hangs until it is
interrupted by Ctrl-C, at which point it exits entirely.

The test program simply gathers the host name from each slave.

> library(Rmpi)
> library(parallel)
> library(snow)

Attaching package: ‘snow’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

>
> cl <- makeCluster(4)
    4 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()['nodename'])
[[1]]
                   nodename
"gl-build.arc-ts.umich.edu"

[[2]]
                   nodename
"gl-build.arc-ts.umich.edu"

[[3]]
                   nodename
"gl-build.arc-ts.umich.edu"

[[4]]
                   nodename
"gl-build.arc-ts.umich.edu"

> stopCluster(cl)

at which point intervention is required.

Any thoughts on what might be wrong and how I should go about fixing it?

Let me know if you need additional information, please.

Thank you,    -- bennet



More information about the R-sig-hpc mailing list