[Rd] SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Mar 27 20:52:21 CET 2019


The problem causing the stray worker processes when the master fails to 
open a server socket to listen to connections from workers is not 
related to timeout in socketConnection(), because socketConnection() 
will fail right away. It is caused by a bug in checking the setup 
timeout (PR 17391).

Fixed in 76275.

Best
Tomas

On 3/18/19 2:23 AM, Henrik Bengtsson wrote:
> (Bcc: CRAN)
>
> This is a proposal helping CRAN and alike as well as individual
> developers to avoid stray R processes being left behind that might be
> produced when an example or a package test fails to set up a
> parallel::makeCluster().
>
>
> ISSUE
>
> If a package test sets up a PSOCK cluster and then the master process
> dies for one reason or the other, the PSOCK worker processes will
> remain running for 30 days ('timeout') until they timeout and
> terminate that way.  When this happens on CRAN servers, where many
> packages are checked all the time, this will result in a lot of stray
> R processes.
>
> Here is an example illustrating how R leaves behind stray R processes
> if fails to establish a connection to one or more background R
> processes launched by 'parallel::makeCluster()'.  First, let's make
> sure there are no other R processes running:
>
>    $ ps aux | grep -E "exec[/]R"
>
> Then, lets create a PSOCK cluster for which connection will fail
> (because port 80 is reserved):
>
>    $ Rscript -e 'parallel::makeCluster(1L, port=80)'
>    Error in socketConnection("localhost", port = port, server = TRUE,
> blocking = TRUE,  :
>      cannot open the connection
>    Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection
>    In addition: Warning message:
>    In socketConnection("localhost", port = port, server = TRUE,
> blocking = TRUE,  :
>      port 80 cannot be opened
>
> The launched R worker is still running:
>
>    $ ps aux | grep -E "exec[/]R"
>    hb       20778 37.0  0.4 283092 70624 pts/0    S    17:50   0:00
> /usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK()
> --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120
> TIMEOUT=2 592000 XDR=TRUE
>
> This process will keep running for 'TIMEOUT=2592000' seconds (= 30
> days).  The reason for this is that it is currently in the state where
> it attempts to set up a connection to the main R process:
>
>    > parallel:::.slaveRSOCK
>    function ()
>    {
>        makeSOCKmaster <- function(master, port, setup_timeout, timeout,
>            useXDR) {
>     ...
>            repeat {
>                con <- tryCatch({
>                    socketConnection(master, port = port, blocking = TRUE,
>                      open = "a+b", timeout = timeout)
>                }, error = identity)
>        ...
>    }
>
> In other words, it is stuck in 'socketConnection()' and it won't time
> out until 'timeout' seconds.
>
>
> SUGGESTION
>
> To mitigate the problem with above stray processes from running 'R CMD
> check', we could shorten the 'timeout' which is currently hardcoded to
> 30 days (src/library/parallel/R/snow.R).  By making it possible to
> control the default via environment variables, e.g.
>
>    setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60
> * 2)),      # 2 minutes
>    timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60
> * 24 * 30)), # 30 days
>
> it would be straightforward to adjust `R CMD check` to use, say,
>
>    R_PARALLEL_SETUP_TIMEOUT=60
>
> by default.  This would cause any stray processes to time out after 60
> seconds (instead of 30 days as now).
>
> /Henrik
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list