[R-pkg-devel] RFC: An ad-hoc "cluster" one can leave and rejoin later

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Fri Apr 28 09:48:38 CEST 2023


On Thu, 27 Apr 2023 11:47:27 +0000
"Viechtbauer, Wolfgang (NP)"
<wolfgang.viechtbauer using maastrichtuniversity.nl> wrote:

> Can you clarify what happens if a node disconnects from the pool
> while it is running some assigned task? I assume/hope the pool server
> keeps track of that and will then submit the nonfinished task to
> another node.

This is exactly what happens. In the function that removes a node from
the pool, there is a check for a pending task associated with that
node. If such a task is found, it's put back at the end of the queue.

(So if you accidentally create a task that crashes a node in a way that
cannot be caught by tryCatch(), it will eventually take the whole pool
offline. On the other hand, if the nodes are automatically restarted,
they will run all other tasks in the queue before encountering the
crashing task again.)

I'd like to write an integration test that would create a pool with two
nodes, send two tasks consisting of Sys.sleep() to them, then crash one
of the nodes after they accept the tasks. Even without the crashing
(which could be part of the task, if (node_destined_to_crash) q('no')),
this is some hair-raising code: I need multiple child processes running
with the same temporary library where the package version being tested
is installed, and also some synchronisation between them.

> Also, are there any issues with using the pool machine also as a node?

There shouldn't be. In fact, I should probably add a parameter to the
run_pool() function that automatically creates a number of nodes on the
same machine.

> PS: In the README, 'cliends' -> 'clients'.

Thanks!

-- 
Best regards,
Ivan



More information about the R-package-devel mailing list