[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Henrik Bengtsson henrik.bengtsson at gmail.com
Sat Mar 10 02:52:16 CET 2018


Great.

For the record of this thread, I've submitted patch PR17391
(https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17391).  I've
patched it against the latest R-devel on the SVN, passes 'make
check-all', and I've verified it works with the above tests.

/Henrik

On Fri, Mar 9, 2018 at 4:37 AM,  <luke-tierney at uiowa.edu> wrote:
> I'm happy to look at a patch that does this.  I'd start with a small
> interval and increase it by 50%, say, on each try wit a max retry time
> limit. This isn't eliminating the problem,only reducing the
> probability, but still worth it. I had considered doing something like
> this but it didn't seem necessary at the time. You don't want to retry
> indefinitely since the connection could be failing because the master
> died, and then you want the workers to die as well.
>
> Best,
>
> luke
>
>
> On Fri, 9 Mar 2018, Henrik Bengtsson wrote:
>
>> A solution is to have parallel:::.slaveRSOCK() attempt to connect
>> multiple times before failing, e.g.
>>
>>    makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries
>> = 10L, interval = 1.0) {
>>        port <- as.integer(port)
>>        for (i in seq_len(maxTries)) {
>>          con <- tryCatch({
>>            socketConnection(master, port = port, blocking = TRUE,
>>                                    open = "a+b", timeout = timeout)
>>          }, error = identity)
>>          if (inherits(con, "connection")) break
>>          Sys.sleep(interval)
>>        }
>>        if (inherits(con, "error")) stop(con)
>>        structure(list(con = con), class = if (useXDR)
>>            "SOCKnode"
>>        else "SOCK0node")
>>    }
>>
>> One could set 'maxTries' and 'interval' via commandArgs() like what is
>> done for the other arguments.
>>
>> I'm happy to submit an SVN patch if R core thinks this is an
>> acceptable solution.
>>
>> /Henrik
>>
>> On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson
>> <henrik.bengtsson at gmail.com> wrote:
>>>
>>> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
>>> socketConnection() as a character, i.e. there's a missing timeout <-
>>> as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
>>> as.logical(value):
>>>
>>>> parallel:::.slaveRSOCK
>>>
>>> function ()
>>> {
>>>     makeSOCKmaster <- function(master, port, timeout, useXDR) {
>>>         port <- as.integer(port)
>>>         con <- socketConnection(master, port = port, blocking = TRUE,
>>>             open = "a+b", timeout = timeout)
>>>         structure(list(con = con), class = if (useXDR)
>>>             "SOCKnode"
>>>         else "SOCK0node")
>>>     }
>>>     master <- "localhost"
>>>     port <- NA_integer_
>>>     outfile <- Sys.getenv("R_SNOW_OUTFILE")
>>>     methods <- TRUE
>>>     useXDR <- TRUE
>>>     for (a in commandArgs(TRUE)) {
>>>         pos <- regexpr("=", a)
>>>         name <- substr(a, 1L, pos - 1L)
>>>         value <- substr(a, pos + 1L, nchar(a))
>>>         switch(name, MASTER = {
>>>             master <- value
>>>         }, PORT = {
>>>             port <- value
>>>         }, OUT = {
>>>             outfile <- value
>>>         }, TIMEOUT = {
>>>             timeout <- value
>>>         }, XDR = {
>>>             useXDR <- as.logical(value)
>>>         })
>>>     }
>>>     if (is.na(port))
>>>         stop("PORT must be specified")
>>>     sinkWorkerOutput(outfile)
>>>     msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
>>>         paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
>>>     cat(msg)
>>>     slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
>>> }
>>> <bytecode: 0x4bd4b58>
>>> <environment: namespace:parallel>
>>>
>>> Yet, fix that does *not* seem to change anything.
>>>
>>> /Henrik
>>>
>>> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
>>> <henrik.bengtsson at gmail.com> wrote:
>>>>
>>>> BACKGROUND:
>>>>
>>>> While troubleshooting random, occasionally occurring, errors from
>>>> parallel::makePSOCKcluster("localhost", port = 11000);
>>>>
>>>> Error in socketConnection("localhost", port = port, server = TRUE,
>>>> blocking = TRUE, :
>>>>     cannot open the connection
>>>>
>>>> I had another look at parallel:::newPSOCKnode(), which is used
>>>> internally to set up each background worker.  It is designed to, first
>>>> launch the background worker as:
>>>>
>>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
>>>> wait = FALSE)
>>>>
>>>> which immediately tries to connect to a socket on localhost:11000 with
>>>> timeout.  Immediately after the master launched the above (without
>>>> waiting), it will set up the connection waiting for the connect from
>>>> the background worker:
>>>>
>>>>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>>>>         blocking = TRUE, open = "a+b", timeout = timeout)
>>>>
>>>>
>>>> ISSUE:
>>>>
>>>> If we emulate the above process, and remove the OUT=/dev/null such
>>>> that we can see the output produces by the worker, as:
>>>>
>>>> setup <- function(delay = 0) {
>>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>>>>   Sys.sleep(delay)
>>>>   socketConnection("localhost", port = 11000, server = TRUE, blocking
>>>> = TRUE, open = "a+b", timeout = 20)
>>>> }
>>>>
>>>> doing:
>>>>
>>>>> con <- setup(0)
>>>>
>>>> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>>>>
>>>> will most likely work, but adding a delay:
>>>>
>>>>> con <- setup(5)
>>>>
>>>> starting worker pid=25099 on localhost:11000 at 18:45:23.617
>>>> Warning in socketConnection(master, port = port, blocking = TRUE, open
>>>> = "a+b",  :
>>>>   localhost:11000 cannot be opened
>>>> Error in socketConnection(master, port = port, blocking = TRUE, open =
>>>> "a+b",  :
>>>>   cannot open the connection
>>>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
>>>> socketConnection
>>>>
>>>> will produce an *instant* error on the worker, and before master opens
>>>> the server socket.  Eventually, master will produce the originally
>>>> observed error:
>>>>
>>>> Error in socketConnection("localhost", port = 11000, server = TRUE,
>>>> blocking = TRUE,  :
>>>>   cannot open the connection
>>>>
>>>> In other words, if the master fails to setup socketConnection()
>>>> *before* the background workers attempts to connect, it all fails.
>>>> Such a delay may happen for instance when there is a large CPU load on
>>>> the test machine.
>>>>
>>>> Is this the above bug?
>>>>
>>>> /Henrik
>>>>
>>>> PS. The background is that I, very occasionally, observe R CMD check
>>>> error on the above (on CRAN and elsewhere) when testing my future
>>>> package. The error always go away when retested. This far I've though
>>>> this is due to port clashes (since the port is random selected in
>>>> [11000:11999]) and accepted that it happens.  However, after
>>>> discovering the above, it could be due to the worker launching "too
>>>> soon".
>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 3.4.3 (2017-11-30)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>> Running under: Ubuntu 16.04.4 LTS
>>>>
>>>> Matrix products: default
>>>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
>>>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] compiler_3.4.3
>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>    Actuarial Science
> 241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu



More information about the R-devel mailing list