[R-pkg-devel] Guidelines on use of snow-style clusters in R packages?

Sun May 24 16:21:47 CEST 2020

Some of the packages I use make it possible to run some of the
computations in parallel. For example, sNPLS::cv_snpls calls
makeCluster() itself, makes sure that the package is loaded by workers,
exports the necessary variables and stops the cluster after it is
finished. On the other hand, multiway::parafac accepts arbitrary
cluster objects supplied by user, but requires the user to manually
preload the package on the workers. Both packages export and document
the internal functions intended to run on the workers.

Are there any guidelines for use of snow-style clusters in R packages? I
remember reading somewhere that accepting arbitrary cluster objects from
the user instead of makeCluster(detectCores()) is generally considered
a good idea (for multiple reasons ranging from giving the user more
control of CPU load to making it possible to run the code on a number
of networked machines that the package code knows nothing about), but I
couldn't find a reference for that in Writing R Extensions or parallel
package documentation.

What about preloading the package on the workers? Are there any
downsides to the package code unconditionally running clusterEvalQ(cl,
library(myself)) to avoid disappointing errors like "10 nodes produced
errors; first error: could not find function"?

Speaking of private functions intended to run by the package itself on
the worker nodes, should they be exported? I have prepared a test
package doing little more than the following:

R/fun.R:
private <- function(x) paste(x, Sys.getpid())
public <- function(cl, x) parallel::parLapply(cl, x, private)

NAMESPACE:
export(public)

The package passes R CMD check --as-cran without warnings or errors,
which seems to suggest that exporting worker functions is not required.

-- 
Best regards,
Ivan