[Rd] SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()
Henrik Bengtsson
henr|k@bengt@@on @end|ng |rom gm@||@com
Thu Mar 28 05:20:36 CET 2019
Thank you Tomas.
For the record, I'm confirming that the stray background R worker
process now times out properly after 'setup_timeout' (= 120) seconds:
{0s}$ Rscript -e 'parallel::makeCluster(1L, port=80)'
Error in socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE, :
cannot open the connection
Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
port 80 cannot be opened
Execution halted
{1s}$ ps aux | grep -E "exec[/]R"
hb 17645 2.0 0.3 259104 55144 pts/5 S 20:58 0:00
/home/hb/software/R-devel/trunk/lib/R/bin/exec/R --slave --no-restore
-e parallel:::.slaveRSOCK() --args MASTER=localhost PORT=80
OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE
{2s}$ sleep 120
{122s}$ ps aux | grep -E "exec[/]R"
{122s}$
Good spotting of the bug:
- if (Sys.time() - t0 > setup_timeout) break
+ if (difftime(Sys.time(), t0, units="secs") > setup_timeout) break
For those who find this thread, I think what's going on here is that
'setup_timeout = 120' is a numeric that is compared a 'difftime' than
keeps changing unit as times goes by. When compared as 'Sys.time() -
t0 > setup_timeout' the LHS would be in units of seconds as long as
less than 60 seconds had passed:
> Sys.time() - t0
Time difference of 59 secs
> as.numeric(Sys.time() - t0)
[1] 59
However, as soon as more than 60 seconds has passed, the unit turns
into minutes and we're comparing minutes to seconds:
> Sys.time() - t0
Time difference of 1.016667 mins
> as.numeric(Sys.time() - t0)
[1] 1.016667
which is now compared to 'setup_timeout'. If the unit remained to be
minutes it would timeout after 120 [minutes]. However, after 120
minutes, the unit of Sys.time() - t0 is in hours, and we're comparing
hours to seconds, and so on. It would only timeout if we used
'setup_timeout' < 60 seconds.
/Henrik
On Wed, Mar 27, 2019 at 12:52 PM Tomas Kalibera
<tomas.kalibera using gmail.com> wrote:
>
>
> The problem causing the stray worker processes when the master fails to
> open a server socket to listen to connections from workers is not
> related to timeout in socketConnection(), because socketConnection()
> will fail right away. It is caused by a bug in checking the setup
> timeout (PR 17391).
>
> Fixed in 76275.
>
> Best
> Tomas
>
> On 3/18/19 2:23 AM, Henrik Bengtsson wrote:
> > (Bcc: CRAN)
> >
> > This is a proposal helping CRAN and alike as well as individual
> > developers to avoid stray R processes being left behind that might be
> > produced when an example or a package test fails to set up a
> > parallel::makeCluster().
> >
> >
> > ISSUE
> >
> > If a package test sets up a PSOCK cluster and then the master process
> > dies for one reason or the other, the PSOCK worker processes will
> > remain running for 30 days ('timeout') until they timeout and
> > terminate that way. When this happens on CRAN servers, where many
> > packages are checked all the time, this will result in a lot of stray
> > R processes.
> >
> > Here is an example illustrating how R leaves behind stray R processes
> > if fails to establish a connection to one or more background R
> > processes launched by 'parallel::makeCluster()'. First, let's make
> > sure there are no other R processes running:
> >
> > $ ps aux | grep -E "exec[/]R"
> >
> > Then, lets create a PSOCK cluster for which connection will fail
> > (because port 80 is reserved):
> >
> > $ Rscript -e 'parallel::makeCluster(1L, port=80)'
> > Error in socketConnection("localhost", port = port, server = TRUE,
> > blocking = TRUE, :
> > cannot open the connection
> > Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection
> > In addition: Warning message:
> > In socketConnection("localhost", port = port, server = TRUE,
> > blocking = TRUE, :
> > port 80 cannot be opened
> >
> > The launched R worker is still running:
> >
> > $ ps aux | grep -E "exec[/]R"
> > hb 20778 37.0 0.4 283092 70624 pts/0 S 17:50 0:00
> > /usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK()
> > --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120
> > TIMEOUT=2 592000 XDR=TRUE
> >
> > This process will keep running for 'TIMEOUT=2592000' seconds (= 30
> > days). The reason for this is that it is currently in the state where
> > it attempts to set up a connection to the main R process:
> >
> > > parallel:::.slaveRSOCK
> > function ()
> > {
> > makeSOCKmaster <- function(master, port, setup_timeout, timeout,
> > useXDR) {
> > ...
> > repeat {
> > con <- tryCatch({
> > socketConnection(master, port = port, blocking = TRUE,
> > open = "a+b", timeout = timeout)
> > }, error = identity)
> > ...
> > }
> >
> > In other words, it is stuck in 'socketConnection()' and it won't time
> > out until 'timeout' seconds.
> >
> >
> > SUGGESTION
> >
> > To mitigate the problem with above stray processes from running 'R CMD
> > check', we could shorten the 'timeout' which is currently hardcoded to
> > 30 days (src/library/parallel/R/snow.R). By making it possible to
> > control the default via environment variables, e.g.
> >
> > setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60
> > * 2)), # 2 minutes
> > timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60
> > * 24 * 30)), # 30 days
> >
> > it would be straightforward to adjust `R CMD check` to use, say,
> >
> > R_PARALLEL_SETUP_TIMEOUT=60
> >
> > by default. This would cause any stray processes to time out after 60
> > seconds (instead of 30 days as now).
> >
> > /Henrik
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
More information about the R-devel
mailing list