[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

luke-tierney at uiowa.edu luke-tierney at uiowa.edu
Fri Mar 16 16:56:43 CET 2018


Thanks. Fix committed to R-devel in r74417.

Best,

luke

On Sat, 10 Mar 2018, Henrik Bengtsson wrote:

> Great.
>
> For the record of this thread, I've submitted patch PR17391
> (https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17391).  I've
> patched it against the latest R-devel on the SVN, passes 'make
> check-all', and I've verified it works with the above tests.
>
> /Henrik
>
> On Fri, Mar 9, 2018 at 4:37 AM,  <luke-tierney at uiowa.edu> wrote:
>> I'm happy to look at a patch that does this.  I'd start with a small
>> interval and increase it by 50%, say, on each try wit a max retry time
>> limit. This isn't eliminating the problem,only reducing the
>> probability, but still worth it. I had considered doing something like
>> this but it didn't seem necessary at the time. You don't want to retry
>> indefinitely since the connection could be failing because the master
>> died, and then you want the workers to die as well.
>>
>> Best,
>>
>> luke
>>
>>
>> On Fri, 9 Mar 2018, Henrik Bengtsson wrote:
>>
>>> A solution is to have parallel:::.slaveRSOCK() attempt to connect
>>> multiple times before failing, e.g.
>>>
>>>    makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries
>>> = 10L, interval = 1.0) {
>>>        port <- as.integer(port)
>>>        for (i in seq_len(maxTries)) {
>>>          con <- tryCatch({
>>>            socketConnection(master, port = port, blocking = TRUE,
>>>                                    open = "a+b", timeout = timeout)
>>>          }, error = identity)
>>>          if (inherits(con, "connection")) break
>>>          Sys.sleep(interval)
>>>        }
>>>        if (inherits(con, "error")) stop(con)
>>>        structure(list(con = con), class = if (useXDR)
>>>            "SOCKnode"
>>>        else "SOCK0node")
>>>    }
>>>
>>> One could set 'maxTries' and 'interval' via commandArgs() like what is
>>> done for the other arguments.
>>>
>>> I'm happy to submit an SVN patch if R core thinks this is an
>>> acceptable solution.
>>>
>>> /Henrik
>>>
>>> On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson
>>> <henrik.bengtsson at gmail.com> wrote:
>>>>
>>>> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
>>>> socketConnection() as a character, i.e. there's a missing timeout <-
>>>> as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
>>>> as.logical(value):
>>>>
>>>>> parallel:::.slaveRSOCK
>>>>
>>>> function ()
>>>> {
>>>>     makeSOCKmaster <- function(master, port, timeout, useXDR) {
>>>>         port <- as.integer(port)
>>>>         con <- socketConnection(master, port = port, blocking = TRUE,
>>>>             open = "a+b", timeout = timeout)
>>>>         structure(list(con = con), class = if (useXDR)
>>>>             "SOCKnode"
>>>>         else "SOCK0node")
>>>>     }
>>>>     master <- "localhost"
>>>>     port <- NA_integer_
>>>>     outfile <- Sys.getenv("R_SNOW_OUTFILE")
>>>>     methods <- TRUE
>>>>     useXDR <- TRUE
>>>>     for (a in commandArgs(TRUE)) {
>>>>         pos <- regexpr("=", a)
>>>>         name <- substr(a, 1L, pos - 1L)
>>>>         value <- substr(a, pos + 1L, nchar(a))
>>>>         switch(name, MASTER = {
>>>>             master <- value
>>>>         }, PORT = {
>>>>             port <- value
>>>>         }, OUT = {
>>>>             outfile <- value
>>>>         }, TIMEOUT = {
>>>>             timeout <- value
>>>>         }, XDR = {
>>>>             useXDR <- as.logical(value)
>>>>         })
>>>>     }
>>>>     if (is.na(port))
>>>>         stop("PORT must be specified")
>>>>     sinkWorkerOutput(outfile)
>>>>     msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
>>>>         paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
>>>>     cat(msg)
>>>>     slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
>>>> }
>>>> <bytecode: 0x4bd4b58>
>>>> <environment: namespace:parallel>
>>>>
>>>> Yet, fix that does *not* seem to change anything.
>>>>
>>>> /Henrik
>>>>
>>>> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
>>>> <henrik.bengtsson at gmail.com> wrote:
>>>>>
>>>>> BACKGROUND:
>>>>>
>>>>> While troubleshooting random, occasionally occurring, errors from
>>>>> parallel::makePSOCKcluster("localhost", port = 11000);
>>>>>
>>>>> Error in socketConnection("localhost", port = port, server = TRUE,
>>>>> blocking = TRUE, :
>>>>>     cannot open the connection
>>>>>
>>>>> I had another look at parallel:::newPSOCKnode(), which is used
>>>>> internally to set up each background worker.  It is designed to, first
>>>>> launch the background worker as:
>>>>>
>>>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
>>>>> wait = FALSE)
>>>>>
>>>>> which immediately tries to connect to a socket on localhost:11000 with
>>>>> timeout.  Immediately after the master launched the above (without
>>>>> waiting), it will set up the connection waiting for the connect from
>>>>> the background worker:
>>>>>
>>>>>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>>>>>         blocking = TRUE, open = "a+b", timeout = timeout)
>>>>>
>>>>>
>>>>> ISSUE:
>>>>>
>>>>> If we emulate the above process, and remove the OUT=/dev/null such
>>>>> that we can see the output produces by the worker, as:
>>>>>
>>>>> setup <- function(delay = 0) {
>>>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>>>>>   Sys.sleep(delay)
>>>>>   socketConnection("localhost", port = 11000, server = TRUE, blocking
>>>>> = TRUE, open = "a+b", timeout = 20)
>>>>> }
>>>>>
>>>>> doing:
>>>>>
>>>>>> con <- setup(0)
>>>>>
>>>>> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>>>>>
>>>>> will most likely work, but adding a delay:
>>>>>
>>>>>> con <- setup(5)
>>>>>
>>>>> starting worker pid=25099 on localhost:11000 at 18:45:23.617
>>>>> Warning in socketConnection(master, port = port, blocking = TRUE, open
>>>>> = "a+b",  :
>>>>>   localhost:11000 cannot be opened
>>>>> Error in socketConnection(master, port = port, blocking = TRUE, open =
>>>>> "a+b",  :
>>>>>   cannot open the connection
>>>>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
>>>>> socketConnection
>>>>>
>>>>> will produce an *instant* error on the worker, and before master opens
>>>>> the server socket.  Eventually, master will produce the originally
>>>>> observed error:
>>>>>
>>>>> Error in socketConnection("localhost", port = 11000, server = TRUE,
>>>>> blocking = TRUE,  :
>>>>>   cannot open the connection
>>>>>
>>>>> In other words, if the master fails to setup socketConnection()
>>>>> *before* the background workers attempts to connect, it all fails.
>>>>> Such a delay may happen for instance when there is a large CPU load on
>>>>> the test machine.
>>>>>
>>>>> Is this the above bug?
>>>>>
>>>>> /Henrik
>>>>>
>>>>> PS. The background is that I, very occasionally, observe R CMD check
>>>>> error on the above (on CRAN and elsewhere) when testing my future
>>>>> package. The error always go away when retested. This far I've though
>>>>> this is due to port clashes (since the port is random selected in
>>>>> [11000:11999]) and accepted that it happens.  However, after
>>>>> discovering the above, it could be due to the worker launching "too
>>>>> soon".
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 3.4.3 (2017-11-30)
>>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>> Running under: Ubuntu 16.04.4 LTS
>>>>>
>>>>> Matrix products: default
>>>>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
>>>>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>>>>>
>>>>> locale:
>>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] compiler_3.4.3
>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> --
>> Luke Tierney
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>    Actuarial Science
>> 241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>

-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu



More information about the R-devel mailing list