[Bioc-devel] how to achieve reproducibility with BiocParallel regardless of number of threads and OS (set.seed is disallowed)
Lulu Chen
luluchen @ending from vt@edu
Thu Jan 3 17:38:52 CET 2019
Thanks for teaching me how to set a seed for each job!
On Wed, Jan 2, 2019 at 9:45 AM Martin Morgan <mtmorgan.bioc using gmail.com>
wrote:
> I'll back-track on my advice a little, and say that the right way to
> enable the user to get reproducible results is to respect the setting the
> user makes outside your function. So for
>
> your = function()
> unlist(bplapply(1:4, rnorm))
>
> The user will
>
> register(MulticoreParam(2, RNGseed=123))
> your()
>
> to always produces the identical result.
>
> Following Aaron's strategy, the R-level approach to reproducibility might
> be along the lines of
>
> - tell the user to set parallel::RNGkind("L'Ecuyer-CMRG") and set.seed()
> - In your function, generate seeds for each job
>
> n = 5; seeds <- vector("list", n)
> seeds[[1]] = .Random.seed # FIXME fails if set.seed or random nos.
> have not been generated...
> for (i in tail(seq_len(n), -1)) seeds[[i]] = nextRNGStream(seeds[[i -
> 1]])
>
> - send these, along with the job, to the workers, setting .Random.seed on
> each worker
>
> bpmapply(function(i, seed, ...) {
> oseed <- get(".Random.seed", envir = .GlobalEnv)
> on.exit(assign(".Random.seed", oseed, envir = .GlobalEnv))
> assign(".Random.seed", seed, envir = .GlobalEnv)
> ...
> }, seq_len(n), seeds, ...)
>
> The use of L'Ecuyer-CMRG and `nextRNGStream()` means that the streams on
> each worker are independent. Using on.exit means that, even on the worker,
> the state of the random number generator is not changed by the evaluation.
> This means that even with SerialParam() the generator is well-behaved. I
> don’t know how BiocCheck responds to use of .Random.seed, which in general
> would be a bad thing to do but in this case with the use of on.exit() the
> usage seems ok.
>
> Martin
>
>
> On 12/31/18, 3:17 PM, "Lulu Chen" <luluchen using vt.edu> wrote:
>
> Hi Martin,
>
>
> Thanks for your help. But setting different number of workers will
> generate different results:
>
>
> > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(1, RNGseed=123)))
> [1] 1.0654274 -1.2421454 1.0523311 -0.7744536 1.3081934
> -1.5305223 1.1525356 0.9287607 -0.4355877 1.5055436
> > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(2, RNGseed=123)))
> [1] -0.9685927 0.7061091 1.4890213 -0.4094454 0.8909694
> -0.8653704 1.4642711 1.2674845 -0.2220491 2.4505322
> > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(3, RNGseed=123)))
> [1] -0.96859273 -0.40944544 0.89096942 -0.86537045 1.46427111
> 1.26748453 -0.48906078 0.43304237 -0.03195349
> [10] 0.14670372
> > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(4, RNGseed=123)))
> [1] -0.96859273 -0.40944544 0.89096942 -0.48906078 0.43304237
> -0.03195349 -1.03886641 1.57451249 0.74708204
> [10] 0.67187201
>
>
>
> Best,
> Lulu
>
>
>
> On Mon, Dec 31, 2018 at 1:12 PM Martin Morgan <mtmorgan.bioc using gmail.com>
> wrote:
>
>
> The major BiocParallel objects (SnowParam(), MulticoreParam()) and use
> of bplapply() allow fully repeatable randomizations, e.g.,
>
> > library(BiocParallel)
> > unlist(bplapply(1:4, rnorm, BPPARAM=MulticoreParam(RNGseed=123)))
> [1] -0.96859273 -0.40944544 0.89096942 -0.48906078 0.43304237
> -0.03195349
> [7] -1.03886641 1.57451249 0.74708204 0.67187201
> > unlist(bplapply(1:4, rnorm, BPPARAM=MulticoreParam(RNGseed=123)))
> [1] -0.96859273 -0.40944544 0.89096942 -0.48906078 0.43304237
> -0.03195349
> [7] -1.03886641 1.57451249 0.74708204 0.67187201
> > unlist(bplapply(1:4, rnorm, BPPARAM=SnowParam(RNGseed=123)))
> [1] -0.96859273 -0.40944544 0.89096942 -0.48906078 0.43304237
> -0.03195349
> [7] -1.03886641 1.57451249 0.74708204 0.67187201
>
> The idea then would be to tell the user to register() such a param, or
> to write your function to accept an argument rngSeed along the lines of
>
> f = function(..., rngSeed = NULL) {
> if (!is.null(rngSeed)) {
> param = bpparam() # user's preferred back-end
> oseed = bpRNGseed(param)
> on.exit(bpRNGseed(param) <- oseed)
> bpRNGseed(param) = rngSeed
> }
> bplapply(1:4, rnorm)
> }
>
> (actually, this exercise illustrates a problem with bpRNGseed<-() when
> the original seed is NULL; this will be fixed in the next day or so...)
>
> Is that sufficient for your use case?
>
> On 12/31/18, 11:24 AM, "Bioc-devel on behalf of Lulu Chen" <
> bioc-devel-bounces using r-project.org on behalf of
> luluchen using vt.edu> wrote:
>
> Dear all,
>
> I posted the question in the Bioconductor support site (
>
> https://support.bioconductor.org/p/116381/ <
> https://support.bioconductor.org/p/116381/>) and was suggested to direct
> future correspondence there.
>
> I plan to generate a vector of seeds (provided by users through
> argument of
> my R function) and use them by set.seed() in each parallel
> computation.
> However, set.seed() will cause warning in BiocCheck().
>
> Someone suggested to re-write code using c++, which is a good
> idea. But it
> will take me much more extra time to re-write some functions from
> other
> packages, e.g. eBayes() in limma.
>
> Hope to get more suggestions from you. Thanks a lot!
>
> Best,
> Lulu
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
>
> https://stat.ethz.ch/mailman/listinfo/bioc-devel <
> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list