[R-pkg-devel] [External] Guidelines on use of snow-style clusters in R packages?

iuke-tier@ey m@iii@g oii uiow@@edu iuke-tier@ey m@iii@g oii uiow@@edu
Wed Jun 3 15:54:56 CEST 2020

The basic principle I would follow is to make sure your code only goes
parallel with explicit permission from the end user. One way to do
that is accept a cluster from the caller; another is to create
and shut down your won cluster if a global option is set (via options()
or a mechanism of your own).

If you create and shut down your own cluster you can do pretty much
what you like. If you use one passed to you it would be best to leave
it in the state you found it at least as far as the search path and
global environment are concerned. So use foo::bar instead of library().

User can also set a default cluster. You can use getDefaultCluster to
retrieve it; this returns NULL if no default cluster is set.  You
could assume that if one is set you are allowed to use it, but it
might still be a good idea to look for explicit permission via an
option or an argument. I would again try to leave the cluster used
this way in as clean a state as you can.



On Sun, 24 May 2020, Ivan Krylov wrote:

> Some of the packages I use make it possible to run some of the
> computations in parallel. For example, sNPLS::cv_snpls calls
> makeCluster() itself, makes sure that the package is loaded by workers,
> exports the necessary variables and stops the cluster after it is
> finished. On the other hand, multiway::parafac accepts arbitrary
> cluster objects supplied by user, but requires the user to manually
> preload the package on the workers. Both packages export and document
> the internal functions intended to run on the workers.
> Are there any guidelines for use of snow-style clusters in R packages? I
> remember reading somewhere that accepting arbitrary cluster objects from
> the user instead of makeCluster(detectCores()) is generally considered
> a good idea (for multiple reasons ranging from giving the user more
> control of CPU load to making it possible to run the code on a number
> of networked machines that the package code knows nothing about), but I
> couldn't find a reference for that in Writing R Extensions or parallel
> package documentation.
> What about preloading the package on the workers? Are there any
> downsides to the package code unconditionally running clusterEvalQ(cl,
> library(myself)) to avoid disappointing errors like "10 nodes produced
> errors; first error: could not find function"?
> Speaking of private functions intended to run by the package itself on
> the worker nodes, should they be exported? I have prepared a test
> package doing little more than the following:
> R/fun.R:
> private <- function(x) paste(x, Sys.getpid())
> public <- function(cl, x) parallel::parLapply(cl, x, private)
> export(public)
> The package passes R CMD check --as-cran without warnings or errors,
> which seems to suggest that exporting worker functions is not required.

Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

More information about the R-package-devel mailing list