[Bioc-devel] how to achieve reproducibility with BiocParallel regardless of number of threads and OS (set.seed is disallowed)

Lulu Chen luluchen @ending from vt@edu
Thu Jan 3 17:38:52 CET 2019


Thanks for teaching me how to set a seed for each job!

On Wed, Jan 2, 2019 at 9:45 AM Martin Morgan <mtmorgan.bioc using gmail.com>
wrote:

> I'll back-track on my advice a little, and say that the right way to
> enable the user to get reproducible results is to respect the setting the
> user makes outside your function. So for
>
> your = function()
>      unlist(bplapply(1:4, rnorm))
>
> The user will
>
> register(MulticoreParam(2, RNGseed=123))
> your()
>
> to always produces the identical result.
>
> Following Aaron's strategy, the R-level approach to reproducibility might
> be along the lines of
>
> - tell the user to set parallel::RNGkind("L'Ecuyer-CMRG") and set.seed()
> - In your function, generate seeds for each job
>
>     n = 5; seeds <- vector("list", n)
>     seeds[[1]] = .Random.seed  # FIXME fails if set.seed or random nos.
> have not been generated...
>     for (i in tail(seq_len(n), -1)) seeds[[i]] = nextRNGStream(seeds[[i -
> 1]])
>
> - send these, along with the job, to the workers, setting .Random.seed on
> each worker
>
>     bpmapply(function(i, seed, ...) {
>         oseed <- get(".Random.seed", envir = .GlobalEnv)
>         on.exit(assign(".Random.seed", oseed, envir = .GlobalEnv))
>         assign(".Random.seed", seed, envir = .GlobalEnv)
>         ...
>     }, seq_len(n), seeds, ...)
>
> The use of L'Ecuyer-CMRG and `nextRNGStream()` means that the streams on
> each worker are independent. Using on.exit means that, even on the worker,
> the state of the random number generator is not changed by the evaluation.
> This means that even with SerialParam() the generator is well-behaved. I
> don’t know how BiocCheck responds to use of .Random.seed, which in general
> would be a bad thing to do but in this case with the use of on.exit() the
> usage seems ok.
>
> Martin
>
>
> On 12/31/18, 3:17 PM, "Lulu Chen" <luluchen using vt.edu> wrote:
>
>     Hi Martin,
>
>
>     Thanks for your help. But setting different number of workers will
> generate different results:
>
>
>     > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(1, RNGseed=123)))
>      [1]  1.0654274 -1.2421454  1.0523311 -0.7744536  1.3081934
> -1.5305223  1.1525356  0.9287607 -0.4355877  1.5055436
>     > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(2, RNGseed=123)))
>      [1] -0.9685927  0.7061091  1.4890213 -0.4094454  0.8909694
> -0.8653704  1.4642711  1.2674845 -0.2220491  2.4505322
>     > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(3, RNGseed=123)))
>      [1] -0.96859273 -0.40944544  0.89096942 -0.86537045  1.46427111
> 1.26748453 -0.48906078  0.43304237 -0.03195349
>     [10]  0.14670372
>     > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(4, RNGseed=123)))
>      [1] -0.96859273 -0.40944544  0.89096942 -0.48906078  0.43304237
> -0.03195349 -1.03886641  1.57451249  0.74708204
>     [10]  0.67187201
>
>
>
>     Best,
>     Lulu
>
>
>
>     On Mon, Dec 31, 2018 at 1:12 PM Martin Morgan <mtmorgan.bioc using gmail.com>
> wrote:
>
>
>     The major BiocParallel objects (SnowParam(), MulticoreParam()) and use
> of bplapply() allow fully repeatable randomizations, e.g.,
>
>     > library(BiocParallel)
>     > unlist(bplapply(1:4, rnorm, BPPARAM=MulticoreParam(RNGseed=123)))
>      [1] -0.96859273 -0.40944544  0.89096942 -0.48906078  0.43304237
> -0.03195349
>      [7] -1.03886641  1.57451249  0.74708204  0.67187201
>     > unlist(bplapply(1:4, rnorm, BPPARAM=MulticoreParam(RNGseed=123)))
>      [1] -0.96859273 -0.40944544  0.89096942 -0.48906078  0.43304237
> -0.03195349
>      [7] -1.03886641  1.57451249  0.74708204  0.67187201
>     > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(RNGseed=123)))
>     [1] -0.96859273 -0.40944544  0.89096942 -0.48906078  0.43304237
> -0.03195349
>      [7] -1.03886641  1.57451249  0.74708204  0.67187201
>
>     The idea then would be to tell the user to register() such a param, or
> to write your function to accept an argument rngSeed along the lines of
>
>     f = function(..., rngSeed = NULL) {
>         if (!is.null(rngSeed)) {
>             param = bpparam()  # user's preferred back-end
>             oseed = bpRNGseed(param)
>             on.exit(bpRNGseed(param) <- oseed)
>             bpRNGseed(param) = rngSeed
>         }
>         bplapply(1:4, rnorm)
>     }
>
>     (actually, this exercise illustrates a problem with bpRNGseed<-() when
> the original seed is NULL; this will be fixed in the next day or so...)
>
>     Is that sufficient for your use case?
>
>     On 12/31/18, 11:24 AM, "Bioc-devel on behalf of Lulu Chen" <
> bioc-devel-bounces using r-project.org on behalf of
>     luluchen using vt.edu> wrote:
>
>         Dear all,
>
>         I posted the question in the Bioconductor support site (
>
>     https://support.bioconductor.org/p/116381/ <
> https://support.bioconductor.org/p/116381/>) and was suggested to direct
>         future correspondence there.
>
>         I plan to generate a vector of seeds (provided by users through
> argument of
>         my R function) and use them by set.seed() in each parallel
> computation.
>         However, set.seed() will cause warning in BiocCheck().
>
>         Someone suggested to re-write code using c++, which is a good
> idea. But it
>         will take me much more extra time to re-write some functions from
> other
>         packages, e.g. eBayes() in limma.
>
>         Hope to get more suggestions from you. Thanks a lot!
>
>         Best,
>         Lulu
>
>             [[alternative HTML version deleted]]
>
>         _______________________________________________
>         Bioc-devel using r-project.org mailing list
>
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel <
> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list