[R-pkg-devel] RFC: an interface to manage use of parallelism in packages

Wed Oct 25 14:55:29 CEST 2023

Summary: at the end of this message is a link to an R package
implementing an interface for managing the use of execution units in R
packages. As a package maintainer, would you agree to use something
like this? Does it look sufficiently reasonable to become a part of R?
Read on for why I made these particular interface choices.

My understanding of the problem stated by Simon Urbanek and Uwe Ligges
[1,2] is that we need a way to set and distribute the CPU core
allowance between multiple packages that could be using very different
methods to achieve parallel execution on the local machine, including
threads and child processes. We could have multiple well-meaning
packages, each of them calling each other using a different parallelism
technology: imagine parallel::makeCluster(getOption('mc.cores'))
combined with parallel::mclapply(mc.cores = getOption('mc.cores')) and
with an OpenMP program that also spawns getOption('mc.cores') threads.
A parallel BLAS or custom multi-threading using std::thread could add
more fuel to the fire.

Workarounds applied by the package maintainers nowadays are both
cumbersome (sometimes one has to talk to some package that lives
downstream in the call stack and isn't even an explicit dependency,
because it's the one responsible for the threads) and not really enough
(most maintainers forget to restore the state after they are done, so a
single example() may slow down the operations that follow).

The problem is complicated by the fact that not every parallel
operation can explicitly accept the CPU core limit as a parameter. For
example, data.table's implicit parallelism is very convenient, and so
are parallel BLASes (which don't have a standard interface to change
the number of threads), so we shouldn't be prohibiting implicit
parallelism.

It's also not always obvious how to split the cores between the
potentially parallel sections. While it's typically best to start with
the outer loop (e.g. better have 16 R processes solving relatively
small linear algebra problems back to back than have one R process
spinning 15 of its 16 OpenBLAS threads in sched_yield()), it may be
more efficient to give all 16 threads back to BLAS (and save on
transferring the problems and solutions between processes) once the
problems become large enough to give enough work to all of the cores.

So as a user, I would like an interface that would both let me give all
of the cores to the program if that's what I need (something like
setCPUallowance(parallelly::availableCores())) _and_ let me be more
detailed when necessary (something like setCPUallowance(overall = 7,
packages = c(foobar = 1), BLAS = 2) to limit BLAS threads to 2,
disallow parallelism in the foobar package because it wastes too much
time, and limit R as a whole to 7 cores because I want to surf the 'net
on the remaining one while the Monte-Carlo simulation is going on). As
a package developer, I'd rather not think about any of that and just
use a function call like getCPUallowance() for the default number of
cores in every situation.

Can we implement such an interface? The main obstacle here is not being
able to know when each parallel region beings and ends. Does the
package call fork()? std::thread{}? Start a local mirai cluster? We
have to trust (and verify during R CMD check) the package to create the
given number of units of execution and tells us when they are done.

The closest interface that I see being implementable is a system of
tokens with reference semantics: getCPUallowance() returns a special
object containing the number of tokens the caller is allowed to use and
sets an environment variable with the remaining number of cores. Any R
child processes pick up the number of cores from the environment
variable. Any downstream calls to getCPUallowance(), aware of the
tokens already handed out, return a reduced number of remaining CPU
cores. Once the package is done executing a parallel section, it
returns the CPU allowance back to R by calling something like
close(token), which updates the internal allowance value (and the
environment variable). (A finalizer can also be set on the tokens to
ensure that CPU cores won't be lost.)

Here's a package implementing this idea:
<https://codeberg.org/aitap/R-CPUallowance>. Currently missing are
terrible hacks to determine the BLAS type at runtime and resolve the
necessary symbols to set the number of BLAS threads, depending on
whether it's OpenBLAS, flexiblas, MKL, or something else. Does it feel
over-engineered? I hope that, even if not a good solution, this would
let us move towards a unified solution that could just work™ on
everything ranging from laptops to CRAN testing machines to HPCs.

-- 
Best regards,
Ivan

[1] https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009484.html

[2] https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009513.html