[Rd] SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()
Henrik Bengtsson
henr|k@bengt@@on @end|ng |rom gm@||@com
Mon Mar 18 02:23:56 CET 2019
(Bcc: CRAN)
This is a proposal helping CRAN and alike as well as individual
developers to avoid stray R processes being left behind that might be
produced when an example or a package test fails to set up a
parallel::makeCluster().
ISSUE
If a package test sets up a PSOCK cluster and then the master process
dies for one reason or the other, the PSOCK worker processes will
remain running for 30 days ('timeout') until they timeout and
terminate that way. When this happens on CRAN servers, where many
packages are checked all the time, this will result in a lot of stray
R processes.
Here is an example illustrating how R leaves behind stray R processes
if fails to establish a connection to one or more background R
processes launched by 'parallel::makeCluster()'. First, let's make
sure there are no other R processes running:
$ ps aux | grep -E "exec[/]R"
Then, lets create a PSOCK cluster for which connection will fail
(because port 80 is reserved):
$ Rscript -e 'parallel::makeCluster(1L, port=80)'
Error in socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE, :
cannot open the connection
Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE, :
port 80 cannot be opened
The launched R worker is still running:
$ ps aux | grep -E "exec[/]R"
hb 20778 37.0 0.4 283092 70624 pts/0 S 17:50 0:00
/usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK()
--args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120
TIMEOUT=2 592000 XDR=TRUE
This process will keep running for 'TIMEOUT=2592000' seconds (= 30
days). The reason for this is that it is currently in the state where
it attempts to set up a connection to the main R process:
> parallel:::.slaveRSOCK
function ()
{
makeSOCKmaster <- function(master, port, setup_timeout, timeout,
useXDR) {
...
repeat {
con <- tryCatch({
socketConnection(master, port = port, blocking = TRUE,
open = "a+b", timeout = timeout)
}, error = identity)
...
}
In other words, it is stuck in 'socketConnection()' and it won't time
out until 'timeout' seconds.
SUGGESTION
To mitigate the problem with above stray processes from running 'R CMD
check', we could shorten the 'timeout' which is currently hardcoded to
30 days (src/library/parallel/R/snow.R). By making it possible to
control the default via environment variables, e.g.
setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60
* 2)), # 2 minutes
timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60
* 24 * 30)), # 30 days
it would be straightforward to adjust `R CMD check` to use, say,
R_PARALLEL_SETUP_TIMEOUT=60
by default. This would cause any stray processes to time out after 60
seconds (instead of 30 days as now).
/Henrik
More information about the R-devel
mailing list