[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Henrik Bengtsson henrik.bengtsson at gmail.com
Fri Mar 9 04:36:54 CET 2018


I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
socketConnection() as a character, i.e. there's a missing timeout <-
as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
as.logical(value):

> parallel:::.slaveRSOCK
function ()
{
    makeSOCKmaster <- function(master, port, timeout, useXDR) {
        port <- as.integer(port)
        con <- socketConnection(master, port = port, blocking = TRUE,
            open = "a+b", timeout = timeout)
        structure(list(con = con), class = if (useXDR)
            "SOCKnode"
        else "SOCK0node")
    }
    master <- "localhost"
    port <- NA_integer_
    outfile <- Sys.getenv("R_SNOW_OUTFILE")
    methods <- TRUE
    useXDR <- TRUE
    for (a in commandArgs(TRUE)) {
        pos <- regexpr("=", a)
        name <- substr(a, 1L, pos - 1L)
        value <- substr(a, pos + 1L, nchar(a))
        switch(name, MASTER = {
            master <- value
        }, PORT = {
            port <- value
        }, OUT = {
            outfile <- value
        }, TIMEOUT = {
            timeout <- value
        }, XDR = {
            useXDR <- as.logical(value)
        })
    }
    if (is.na(port))
        stop("PORT must be specified")
    sinkWorkerOutput(outfile)
    msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
        paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
    cat(msg)
    slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
}
<bytecode: 0x4bd4b58>
<environment: namespace:parallel>

Yet, fix that does *not* seem to change anything.

/Henrik

On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
<henrik.bengtsson at gmail.com> wrote:
> BACKGROUND:
>
> While troubleshooting random, occasionally occurring, errors from
> parallel::makePSOCKcluster("localhost", port = 11000);
>
> Error in socketConnection("localhost", port = port, server = TRUE,
> blocking = TRUE, :
>     cannot open the connection
>
> I had another look at parallel:::newPSOCKnode(), which is used
> internally to set up each background worker.  It is designed to, first
> launch the background worker as:
>
>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
> wait = FALSE)
>
> which immediately tries to connect to a socket on localhost:11000 with
> timeout.  Immediately after the master launched the above (without
> waiting), it will set up the connection waiting for the connect from
> the background worker:
>
>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>         blocking = TRUE, open = "a+b", timeout = timeout)
>
>
> ISSUE:
>
> If we emulate the above process, and remove the OUT=/dev/null such
> that we can see the output produces by the worker, as:
>
> setup <- function(delay = 0) {
>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>   Sys.sleep(delay)
>   socketConnection("localhost", port = 11000, server = TRUE, blocking
> = TRUE, open = "a+b", timeout = 20)
> }
>
> doing:
>
>> con <- setup(0)
> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>
> will most likely work, but adding a delay:
>
>> con <- setup(5)
> starting worker pid=25099 on localhost:11000 at 18:45:23.617
> Warning in socketConnection(master, port = port, blocking = TRUE, open
> = "a+b",  :
>   localhost:11000 cannot be opened
> Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
>   cannot open the connection
> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
> socketConnection
>
> will produce an *instant* error on the worker, and before master opens
> the server socket.  Eventually, master will produce the originally
> observed error:
>
> Error in socketConnection("localhost", port = 11000, server = TRUE,
> blocking = TRUE,  :
>   cannot open the connection
>
> In other words, if the master fails to setup socketConnection()
> *before* the background workers attempts to connect, it all fails.
> Such a delay may happen for instance when there is a large CPU load on
> the test machine.
>
> Is this the above bug?
>
> /Henrik
>
> PS. The background is that I, very occasionally, observe R CMD check
> error on the above (on CRAN and elsewhere) when testing my future
> package. The error always go away when retested. This far I've though
> this is due to port clashes (since the port is random selected in
> [11000:11999]) and accepted that it happens.  However, after
> discovering the above, it could be due to the worker launching "too
> soon".
>
>> sessionInfo()
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.4 LTS
>
> Matrix products: default
> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.3



More information about the R-devel mailing list