[R-sig-hpc] stopCluster hangs instead of exits
Bennet Fauber
bennet @end|ng |rom um|ch@edu
Sat Nov 16 17:59:44 CET 2019
We have a newish installation and are having some issues with
stopCluster() hanging when the cluster object is created using
cl <- makeMPIcluster(5)
from snow.
The base R is 3.6.1. The version of Rmpi is 0.6-9. The version of
OpenMPI against which Rmpi was installed is 3.1.4.
The makeMPIcluster() seems to work, and processes are created. They
look like this, for example,
bennet 26330 16163 0 11:07 pts/15 00:00:00 mpirun -np 1 Rmpi
--no-restore --no-save
bennet 26369 26330 99 11:07 pts/15 00:00:23
/sw/arcts/centos7/stacks/gcc/8.2.0/R/3.6.1/lib64/R/bin/exec/R --slave
--no-restore --file=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1/snow/RMPInode.R
--args SNOWLIB=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1
OUT=/dev/null
bennet 26370 26330 99 11:07 pts/15 00:00:23
/sw/arcts/centos7/stacks/gcc/8.2.0/R/3.6.1/lib64/R/bin/exec/R --slave
--no-restore --file=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1/snow/RMPInode.R
--args SNOWLIB=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1
OUT=/dev/null
bennet 26371 26330 99 11:07 pts/15 00:00:23
/sw/arcts/centos7/stacks/gcc/8.2.0/R/3.6.1/lib64/R/bin/exec/R --slave
--no-restore --file=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1/snow/RMPInode.R
--args SNOWLIB=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1
OUT=/dev/null
bennet 26372 26330 99 11:07 pts/15 00:00:23
/sw/arcts/centos7/stacks/gcc/8.2.0/R/3.6.1/lib64/R/bin/exec/R --slave
--no-restore --file=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1/snow/RMPInode.R
--args SNOWLIB=/sw/arcts/centos7/stacks/gcc/8.2.0/Rmpi/0.6-9/R-3.6.1
OUT=/dev/null
They seem able to do work and communicate OK. The only issue comes
when stopCluster(cl) is called, at which point R hangs until it is
interrupted by Ctrl-C, at which point it exits entirely.
The test program simply gathers the host name from each slave.
> library(Rmpi)
> library(parallel)
> library(snow)
Attaching package: ‘snow’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
parCapply, parLapply, parRapply, parSapply, splitIndices,
stopCluster
>
> cl <- makeCluster(4)
4 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()['nodename'])
[[1]]
nodename
"gl-build.arc-ts.umich.edu"
[[2]]
nodename
"gl-build.arc-ts.umich.edu"
[[3]]
nodename
"gl-build.arc-ts.umich.edu"
[[4]]
nodename
"gl-build.arc-ts.umich.edu"
> stopCluster(cl)
at which point intervention is required.
Any thoughts on what might be wrong and how I should go about fixing it?
Let me know if you need additional information, please.
Thank you, -- bennet
More information about the R-sig-hpc
mailing list