[Rd] SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Mar 27 20:52:21 CET 2019
The problem causing the stray worker processes when the master fails to
open a server socket to listen to connections from workers is not
related to timeout in socketConnection(), because socketConnection()
will fail right away. It is caused by a bug in checking the setup
timeout (PR 17391).
Fixed in 76275.
Best
Tomas
On 3/18/19 2:23 AM, Henrik Bengtsson wrote:
> (Bcc: CRAN)
>
> This is a proposal helping CRAN and alike as well as individual
> developers to avoid stray R processes being left behind that might be
> produced when an example or a package test fails to set up a
> parallel::makeCluster().
>
>
> ISSUE
>
> If a package test sets up a PSOCK cluster and then the master process
> dies for one reason or the other, the PSOCK worker processes will
> remain running for 30 days ('timeout') until they timeout and
> terminate that way. When this happens on CRAN servers, where many
> packages are checked all the time, this will result in a lot of stray
> R processes.
>
> Here is an example illustrating how R leaves behind stray R processes
> if fails to establish a connection to one or more background R
> processes launched by 'parallel::makeCluster()'. First, let's make
> sure there are no other R processes running:
>
> $ ps aux | grep -E "exec[/]R"
>
> Then, lets create a PSOCK cluster for which connection will fail
> (because port 80 is reserved):
>
> $ Rscript -e 'parallel::makeCluster(1L, port=80)'
> Error in socketConnection("localhost", port = port, server = TRUE,
> blocking = TRUE, :
> cannot open the connection
> Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection
> In addition: Warning message:
> In socketConnection("localhost", port = port, server = TRUE,
> blocking = TRUE, :
> port 80 cannot be opened
>
> The launched R worker is still running:
>
> $ ps aux | grep -E "exec[/]R"
> hb 20778 37.0 0.4 283092 70624 pts/0 S 17:50 0:00
> /usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK()
> --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120
> TIMEOUT=2 592000 XDR=TRUE
>
> This process will keep running for 'TIMEOUT=2592000' seconds (= 30
> days). The reason for this is that it is currently in the state where
> it attempts to set up a connection to the main R process:
>
> > parallel:::.slaveRSOCK
> function ()
> {
> makeSOCKmaster <- function(master, port, setup_timeout, timeout,
> useXDR) {
> ...
> repeat {
> con <- tryCatch({
> socketConnection(master, port = port, blocking = TRUE,
> open = "a+b", timeout = timeout)
> }, error = identity)
> ...
> }
>
> In other words, it is stuck in 'socketConnection()' and it won't time
> out until 'timeout' seconds.
>
>
> SUGGESTION
>
> To mitigate the problem with above stray processes from running 'R CMD
> check', we could shorten the 'timeout' which is currently hardcoded to
> 30 days (src/library/parallel/R/snow.R). By making it possible to
> control the default via environment variables, e.g.
>
> setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60
> * 2)), # 2 minutes
> timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60
> * 24 * 30)), # 30 days
>
> it would be straightforward to adjust `R CMD check` to use, say,
>
> R_PARALLEL_SETUP_TIMEOUT=60
>
> by default. This would cause any stray processes to time out after 60
> seconds (instead of 30 days as now).
>
> /Henrik
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list