[R-sig-hpc] doParallel: RNG not reproducible

Frank Weber |r@nk@weber @end|ng |rom tu-dortmund@de
Fri Feb 7 07:43:19 CET 2020


Hi Christian,

thanks a lot for your reply. I was aware of the "doRNG" package but due to
the fact that it supports nested foreach() loops only using a workaround,
I preferred to stick to the original "doParallel" package. Besides, I
posted this issue since I thought it might be worth a bug fix (or a new
feature) in "doParallel". As the non-reproducibility is only subtle
(occuring only sometimes when workers seem to be assigned different
tasks), I thought it might not have come to the package authors' attention
yet. (In the doRNG package's vignette, the function set.seed() is used
when demonstrating that results aren't reproducible, so I thought it might
be worth showing that the non-reproducibility also occurs with
clusterSetRNGStream().)

Nevertheless, if this is a known issue and a bug fix (or a new feature)
for "doParallel" is not planned, I might have to use the workaround for
nested loops shown in the vignette of "doRNG".

Thanks again and best regards,
Frank Weber

> Hi Frank,
>
> Please have a look at the doRNG package.
>
> https://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf
>
> Regards,
> Cristian
>
>
> Cristian Bologa, Ph.D.
> Research Professor,
> Div. of Translational Informatics,
> Dept. of Internal Medicine,
> Univ. of New Mexico, School of Medicine,
> Innovation Discovery&Training Center, MSC09 5025,
> 700 Camino de Salud NE, Albuquerque, NM 87131
> Phone: +1 (505) 925-7534
> Fax:+1 (505) 925-7625
> --------------------------
> "True (artificial) intelligence is not the ability to give an answer, but
> to ask the right question"
>
>
>
> -----Original Message-----
> From: R-sig-hpc [mailto:r-sig-hpc-bounces using r-project.org] On Behalf Of
> Frank Weber
> Sent: Thursday, February 06, 2020 3:00 AM
> To: r-sig-hpc using r-project.org
> Subject: [R-sig-hpc] doParallel: RNG not reproducible
>
> [[-- External - this message has been sent from outside the University
> --]]
>
> Hi everyone,
>
> I am uncertain how to correctly set up the package "doParallel" for
> getting reproducible results in random number generation (RNG). If I run
> the following code repeatedly in a fresh R session, then at some point,
> the stopifnot() check produces an error (indicating the results have
> changed):
>
> ### Start R code
> library(doParallel)
>
> n_slaves <- 8L
> cl_obj <- makeCluster(n_slaves)
> registerDoParallel(cl_obj)
> clusterSetRNGStream(cl_obj, iseed = 2373632L)
>
> rng_res <- foreach(
>   icount(as.integer(n_slaves + floor(n_slaves / 2))),
>   .combine = "cbind"
> ) %dopar% {
>   c(runif(1), rnorm(1))
> }
> if(!file.exists("rng_res.rds")){
>   saveRDS(rng_res, file = "rng_res.rds") } else{
>   rng_res_old <- readRDS(file = "rng_res.rds")
>   stopifnot(identical(rng_res, rng_res_old)) } ### End R code
>
> When inspecting the results in detail (between two runs with differing
> results), it seems that the allocation of computational tasks (i.e. loop
> iterations) to cluster workers is swapped. For example, in one run, I get:
>
> ### Start output
>       result.1   result.2  result.3   result.4   result.5  result.6
> result.7  result.8  result.9 result.10  result.11 [1,] 0.8720487
> 0.4791119 0.7671285  0.2306335  0.2470827 0.7042595
> 0.2103175 0.6149857 0.2153797 0.5944501  0.1431205 [2,] 1.3970093
> -2.1914685 0.2847861 -2.1083101 -1.0850567 0.1582748
> -1.2820137 0.2153303 0.9401810 0.5049244 -1.1084520
>        result.12
> [1,]  0.53079192
> [2,] -0.05597698
> ### End output
>
> and in another run, I get:
>
> ### Start output
>       result.1   result.2  result.3   result.4   result.5  result.6
> result.7  result.8  result.9 result.10   result.11
> [1,] 0.8720487  0.4791119 0.7671285  0.2306335  0.2470827 0.7042595
> 0.2103175 0.6149857 0.2153797 0.5944501  0.53079192 [2,] 1.3970093
> -2.1914685 0.2847861 -2.1083101 -1.0850567 0.1582748
> -1.2820137 0.2153303 0.9401810 0.5049244 -0.05597698
>       result.12
> [1,]  0.1431205
> [2,] -1.1084520
> ### End output
>
> As one can see, columns 11 and 12 are swapped. Thus, it seems to me that
> the allocation of computational tasks to cluster workers is not fixed. In
> the package "doMPI", the documentation states that this fixation is
> handled by argument "defaultopts$seed" in startMPIcluster(). Is there a
> similar function/argument/option in "doParallel"? According to the
> documentation of "doParallel", such a function/argument/option does not
> exist. But then, how do I get reproducible results in "doParallel"?
>
> My sessionInfo():
>
> ### Start output
> R version 3.6.2 (2019-12-12)
> Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64
> (build 18363)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252
> LC_MONETARY=German_Germany.1252
> [4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] doParallel_1.0.15 iterators_1.0.12  foreach_1.4.7
>
> loaded via a namespace (and not attached):
> [1] compiler_3.6.2   tools_3.6.2      codetools_0.2-16
> ### End output
>
> Note: I am using RStudio. Perhaps this might be important.
>
> Thanks in advance and best regards,
> Frank Weber
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>



More information about the R-sig-hpc mailing list