[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)
Henrik Bengtsson
henrik.bengtsson at gmail.com
Sat Mar 10 02:52:16 CET 2018
Great.
For the record of this thread, I've submitted patch PR17391
(https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17391). I've
patched it against the latest R-devel on the SVN, passes 'make
check-all', and I've verified it works with the above tests.
/Henrik
On Fri, Mar 9, 2018 at 4:37 AM, <luke-tierney at uiowa.edu> wrote:
> I'm happy to look at a patch that does this. I'd start with a small
> interval and increase it by 50%, say, on each try wit a max retry time
> limit. This isn't eliminating the problem,only reducing the
> probability, but still worth it. I had considered doing something like
> this but it didn't seem necessary at the time. You don't want to retry
> indefinitely since the connection could be failing because the master
> died, and then you want the workers to die as well.
>
> Best,
>
> luke
>
>
> On Fri, 9 Mar 2018, Henrik Bengtsson wrote:
>
>> A solution is to have parallel:::.slaveRSOCK() attempt to connect
>> multiple times before failing, e.g.
>>
>> makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries
>> = 10L, interval = 1.0) {
>> port <- as.integer(port)
>> for (i in seq_len(maxTries)) {
>> con <- tryCatch({
>> socketConnection(master, port = port, blocking = TRUE,
>> open = "a+b", timeout = timeout)
>> }, error = identity)
>> if (inherits(con, "connection")) break
>> Sys.sleep(interval)
>> }
>> if (inherits(con, "error")) stop(con)
>> structure(list(con = con), class = if (useXDR)
>> "SOCKnode"
>> else "SOCK0node")
>> }
>>
>> One could set 'maxTries' and 'interval' via commandArgs() like what is
>> done for the other arguments.
>>
>> I'm happy to submit an SVN patch if R core thinks this is an
>> acceptable solution.
>>
>> /Henrik
>>
>> On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson
>> <henrik.bengtsson at gmail.com> wrote:
>>>
>>> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
>>> socketConnection() as a character, i.e. there's a missing timeout <-
>>> as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
>>> as.logical(value):
>>>
>>>> parallel:::.slaveRSOCK
>>>
>>> function ()
>>> {
>>> makeSOCKmaster <- function(master, port, timeout, useXDR) {
>>> port <- as.integer(port)
>>> con <- socketConnection(master, port = port, blocking = TRUE,
>>> open = "a+b", timeout = timeout)
>>> structure(list(con = con), class = if (useXDR)
>>> "SOCKnode"
>>> else "SOCK0node")
>>> }
>>> master <- "localhost"
>>> port <- NA_integer_
>>> outfile <- Sys.getenv("R_SNOW_OUTFILE")
>>> methods <- TRUE
>>> useXDR <- TRUE
>>> for (a in commandArgs(TRUE)) {
>>> pos <- regexpr("=", a)
>>> name <- substr(a, 1L, pos - 1L)
>>> value <- substr(a, pos + 1L, nchar(a))
>>> switch(name, MASTER = {
>>> master <- value
>>> }, PORT = {
>>> port <- value
>>> }, OUT = {
>>> outfile <- value
>>> }, TIMEOUT = {
>>> timeout <- value
>>> }, XDR = {
>>> useXDR <- as.logical(value)
>>> })
>>> }
>>> if (is.na(port))
>>> stop("PORT must be specified")
>>> sinkWorkerOutput(outfile)
>>> msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
>>> paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
>>> cat(msg)
>>> slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
>>> }
>>> <bytecode: 0x4bd4b58>
>>> <environment: namespace:parallel>
>>>
>>> Yet, fix that does *not* seem to change anything.
>>>
>>> /Henrik
>>>
>>> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
>>> <henrik.bengtsson at gmail.com> wrote:
>>>>
>>>> BACKGROUND:
>>>>
>>>> While troubleshooting random, occasionally occurring, errors from
>>>> parallel::makePSOCKcluster("localhost", port = 11000);
>>>>
>>>> Error in socketConnection("localhost", port = port, server = TRUE,
>>>> blocking = TRUE, :
>>>> cannot open the connection
>>>>
>>>> I had another look at parallel:::newPSOCKnode(), which is used
>>>> internally to set up each background worker. It is designed to, first
>>>> launch the background worker as:
>>>>
>>>> system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
>>>> wait = FALSE)
>>>>
>>>> which immediately tries to connect to a socket on localhost:11000 with
>>>> timeout. Immediately after the master launched the above (without
>>>> waiting), it will set up the connection waiting for the connect from
>>>> the background worker:
>>>>
>>>> con <- socketConnection("localhost", port = 11000, server = TRUE,
>>>> blocking = TRUE, open = "a+b", timeout = timeout)
>>>>
>>>>
>>>> ISSUE:
>>>>
>>>> If we emulate the above process, and remove the OUT=/dev/null such
>>>> that we can see the output produces by the worker, as:
>>>>
>>>> setup <- function(delay = 0) {
>>>> system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>>>> Sys.sleep(delay)
>>>> socketConnection("localhost", port = 11000, server = TRUE, blocking
>>>> = TRUE, open = "a+b", timeout = 20)
>>>> }
>>>>
>>>> doing:
>>>>
>>>>> con <- setup(0)
>>>>
>>>> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>>>>
>>>> will most likely work, but adding a delay:
>>>>
>>>>> con <- setup(5)
>>>>
>>>> starting worker pid=25099 on localhost:11000 at 18:45:23.617
>>>> Warning in socketConnection(master, port = port, blocking = TRUE, open
>>>> = "a+b", :
>>>> localhost:11000 cannot be opened
>>>> Error in socketConnection(master, port = port, blocking = TRUE, open =
>>>> "a+b", :
>>>> cannot open the connection
>>>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
>>>> socketConnection
>>>>
>>>> will produce an *instant* error on the worker, and before master opens
>>>> the server socket. Eventually, master will produce the originally
>>>> observed error:
>>>>
>>>> Error in socketConnection("localhost", port = 11000, server = TRUE,
>>>> blocking = TRUE, :
>>>> cannot open the connection
>>>>
>>>> In other words, if the master fails to setup socketConnection()
>>>> *before* the background workers attempts to connect, it all fails.
>>>> Such a delay may happen for instance when there is a large CPU load on
>>>> the test machine.
>>>>
>>>> Is this the above bug?
>>>>
>>>> /Henrik
>>>>
>>>> PS. The background is that I, very occasionally, observe R CMD check
>>>> error on the above (on CRAN and elsewhere) when testing my future
>>>> package. The error always go away when retested. This far I've though
>>>> this is due to port clashes (since the port is random selected in
>>>> [11000:11999]) and accepted that it happens. However, after
>>>> discovering the above, it could be due to the worker launching "too
>>>> soon".
>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 3.4.3 (2017-11-30)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>> Running under: Ubuntu 16.04.4 LTS
>>>>
>>>> Matrix products: default
>>>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
>>>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] compiler_3.4.3
>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa Phone: 319-335-3386
> Department of Statistics and Fax: 319-335-3017
> Actuarial Science
> 241 Schaeffer Hall email: luke-tierney at uiowa.edu
> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
More information about the R-devel
mailing list