[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Henrik Bengtsson henrik.bengtsson at gmail.com
Fri Mar 9 04:05:47 CET 2018


BACKGROUND:

While troubleshooting random, occasionally occurring, errors from
parallel::makePSOCKcluster("localhost", port = 11000);

Error in socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE, :
    cannot open the connection

I had another look at parallel:::newPSOCKnode(), which is used
internally to set up each background worker.  It is designed to, first
launch the background worker as:

  system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
wait = FALSE)

which immediately tries to connect to a socket on localhost:11000 with
timeout.  Immediately after the master launched the above (without
waiting), it will set up the connection waiting for the connect from
the background worker:

    con <- socketConnection("localhost", port = 11000, server = TRUE,
        blocking = TRUE, open = "a+b", timeout = timeout)


ISSUE:

If we emulate the above process, and remove the OUT=/dev/null such
that we can see the output produces by the worker, as:

setup <- function(delay = 0) {
  system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
  Sys.sleep(delay)
  socketConnection("localhost", port = 11000, server = TRUE, blocking
= TRUE, open = "a+b", timeout = 20)
}

doing:

> con <- setup(0)
starting worker pid=24983 on localhost:11000 at 18:44:30.087

will most likely work, but adding a delay:

> con <- setup(5)
starting worker pid=25099 on localhost:11000 at 18:45:23.617
Warning in socketConnection(master, port = port, blocking = TRUE, open
= "a+b",  :
  localhost:11000 cannot be opened
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  cannot open the connection
Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
socketConnection

will produce an *instant* error on the worker, and before master opens
the server socket.  Eventually, master will produce the originally
observed error:

Error in socketConnection("localhost", port = 11000, server = TRUE,
blocking = TRUE,  :
  cannot open the connection

In other words, if the master fails to setup socketConnection()
*before* the background workers attempts to connect, it all fails.
Such a delay may happen for instance when there is a large CPU load on
the test machine.

Is this the above bug?

/Henrik

PS. The background is that I, very occasionally, observe R CMD check
error on the above (on CRAN and elsewhere) when testing my future
package. The error always go away when retested. This far I've though
this is due to port clashes (since the port is random selected in
[11000:11999]) and accepted that it happens.  However, after
discovering the above, it could be due to the worker launching "too
soon".

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.3



More information about the R-devel mailing list