[R-sig-hpc] How to attain randomness and reproducibility

Henrik Bengtsson henr|k@bengt@@on @end|ng |rom gm@||@com
Fri Dec 8 20:22:12 CET 2023


Hello,

these are all good questions on a very important topic, especially if
we do statistics that rely on randomness.

First, %do% and %dopar% are completely different creatures. That is
important to take into account here.  It is not possible to generate
numerically reproducible random number between the two - at least not
with built-in methods.  Any attempts to do so would require major
efforts.

For %do%, which is running all sequentially in the main R session, you
can use traditional techniques in R, i.e. set.seed().  This is not
different from how we do it with sequential lapply(), purrr::map(),
etc.

If you parallelize with %dopar%, you want to replace that with %dorng%
from the doRNG package.  Then %dorng% will take care of producing
statistically sound (L'Ecuyer-CMRG) random numbers for your parallel
threads.  The reason why this works is that %dorng% is a wrapper
around %dopar%.  It works with any registerDoNNN() adapter than the
user chooses.

y <- foreach(i = 1:3) %dorng% {
  rnorm(1)
}

You do *not* want to use the alternative that doRNG provides,
registerDoRNG() with %dopar%.  The reason why you do not want to use
that, is that as a developer you loose control.  registerDoRNG() can
be overridden by the end-user, where with %dorng% you are guaranteed
that your foreach() call will use parallel RNG from doRNG.

We should never call set.seed() inside code, functions, or map-reduce
calls, loops, inside package code, etc.  It is not designed for that,
it is not statistically sound, and may introduce biases.  Instead, we
should also aim for setting it at the very beginning of our R scripts,
and leave it to the end-user to be able to adjust it, or remove it.
If we set in an R package, there is a great risk it has a negative,
unforeseeable impact elsewhere.

If you use the Futureverse for parallelization (I'm the maintainer),
you have a third option for foreach() by using the doFuture package
(https://dofuture.futureverse.org/).  Replace %dopar% with %dofuture%.
Then pass argument .options.future = list(seed = TRUE) to foreach(),
to declare you want to use parallel RNGs (L'Ecuyer-CMRG), e.g.

y <- foreach(i = 1:3, .options.future = list(seed = TRUE)) %dofuture% {
  rnorm(1)
}

An alternative for the exact same thing, is:

y <- foreach(i = 1:3) %dofuture% {
  rnorm(1)
} %seed% TRUE

This is fully compatible with how it works when using
future.apply::future_lapply(), furrr::future_map(), etc.

/Henrik

PS. I'll hijack this thread to push out a reminder: foreach() is *not*
a for loop.  It's no difference from using lapply(), purrr::map(),
etc. - it's just a different syntax. To better understand why this is,
we should all challenge ourselves to not use terms like "for-loop" and
"loop" when talking about foreach().

On Fri, Dec 8, 2023 at 10:49 AM David Lundquist
<davidpatricklundquist using gmail.com> wrote:
>
> Currently, I am using the foreach package in R to do the following (in
> R-flavored pseudocode):
>
> *set.seed(1)*
> *foreach square in big grid of parameter combinations **%dopar%*
> *     foreach i in 1:replication_count **%do%*
> *            do some computation, output a row, and rbind it*
> *save the df generated by rbinding*
>
> Possible virtues of the preceding: seed is made clear
> Possible vices of the preceding: failure of randomness
>
> One might therefore try moving set.seed to the inner loop:
>
> *foreach square in big grid of parameter combinations %dopar%*
> *     foreach i in 1:replication_count **%do%*
>             *set.seed(i)*
> *            do some computation, output a row, and rbind it*
> *save the df generated by rbinding*
>
> Possible virtues of the preceding: seed sequence is explicit
> Possible vices of the preceding: we've programmed the seeds to be equal
>
> I'd appreciate any advice, especially if there's an easy reference/vignette
> that can be pointed to, a la
>
> https://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf
> https://cran.r-project.org/web/packages/dqrng/vignettes/parallel.html
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc



More information about the R-sig-hpc mailing list