[Bioc-devel] Using SerialParam() as the registered back-end for all platforms

Kasper Daniel Hansen k@@perd@nielh@n@en @ending from gm@il@com
Tue Jan 8 04:45:44 CET 2019


To add to Henrik's comments, it is also worthwhile to recognize that
mclapply() does not deliver statistically sound random numbers even within
the apply "loop" unless you use RNGkind("L'Ecuyer-CMRG") which is not set
as default. This is because mclapply will initialize random streams with
different seeds and most RNGs does not guarantee that different seeds leads
to independent streams (but L'Ecuyer does).

Reproducibility is nice, but so is statistical correctness.

Best,
Kasper


On Mon, Jan 7, 2019 at 6:26 PM Henrik Bengtsson <henrik.bengtsson using gmail.com>
wrote:

> On Mon, Jan 7, 2019 at 4:09 AM Martin Morgan <mtmorgan.bioc using gmail.com>
> wrote:
> >
> > I hope for 1. to have a 'local socket' (i.e., not using ports)
> implementation shortly.
> >
> > I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We
> now have
> >
> > > library(BiocParallel)
> > > set.seed(1); p = bpparam(); rnorm(1)
> > [1] -0.6264538
> > > set.seed(1); p = bpparam(); rnorm(1)
> > [1] -0.6264538
> >
> > at the expensive of using the generator when the package is loaded.
> >
> > > set.seed(1); rnorm(1)
> > [1] -0.6264538
> > > set.seed(1); library(BiocParallel); rnorm(1)
> > [1] 0.1836433
> >
> > Is that bad? It will be consistent across platforms.
>
> Contrary to resampling methods etc., the RNG for generating random
> ports does _not_ have to be statistically sound.  Because of this, you
> should be able to generate a random port while *preserving the RNG
> state*, i.e. undoing .GlobalEnv$.Random.seed (including removing if in
> a fresh R session).   I decided to add this approach for
> future::makeClusterPSOCK()
> [
> https://github.com/HenrikBengtsson/future/blob/073e1d1/R/makeClusterPSOCK.R#L76-L82
> ].
>
> BTW, here's a useful development tool for detecting when the
> .Random.seed has been updated in an interactive session:
>
> addTaskCallback(local({
>   last <- .GlobalEnv$.Random.seed
>   function(...) {
>     curr <- .GlobalEnv$.Random.seed
>     if (!identical(curr, last)) {
>       warning(".Random.seed changed", call. = FALSE, immediate. = TRUE)
>       last <<- curr
>     }
>     TRUE
>   }
> }), name = "RNG tracker")
>
> >
> > This behavior
> >
> > > set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
> > [1] 0.9624337 0.8925947
> > > set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
> > [1] -0.5703597  0.1102093
> >
> > seems wrong, but is consistent with mclapply
> >
> > > set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
> > [1] -0.02704527  0.40721777
> > > set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
> > [1] -0.8239765  1.2957928
>
> FWIW, without having invested much time tracking down why the design
> is what it is, I'm getting concerned about how parallel::mclapply()
> does parallel RNG out of the box when RNGkind("L'Ecuyer-CMRG") is in
> use.  For example:
>
> > library(parallel)
> > RNGkind("L'Ecuyer-CMRG")
> > unlist(mclapply(1:2, function(i) rnorm(1)))  ## mc.set.seed = TRUE
> [1] -0.71727  0.56171
> > unlist(mclapply(1:2, function(i) rnorm(1)))
> [1] -0.71727  0.56171
> > unlist(mclapply(1:2, function(i) rnorm(1)))
> [1] -0.71727  0.56171
>
> I wonder how many resampling/bootstrap studies are bitten by this
> behavior.  Same if you use mc.set.seed = FALSE;
>
> > unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
> [1] -0.8518311 -0.8518311
> > unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
> [1] -0.8518311 -0.8518311
> > unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
> [1] -0.8518311 -0.8518311
>
>
> Here is how I think about parallel RNG right now:
>
> 1. To achieve fully numerically reproducible RNGs in way that is
> *invariant to the number of workers* (amount of chunking), I think the
> only solution is to pregenerated RNG seeds (using
> parallel::nextRNGStream()) for each individual iteration (element).
> In other words, if a worker will process K elements, then the main R
> process needs to generate K RNG seeds and pass those along to the
> work.  I use this approach for future.apply::future_lapply(...,
> future.seed = TRUE/<initial_seed>), which then produce identical RNG
> results regardless of backend and amount of chunking.  In the past, I
> think I've seen Martin suggesting something similar as a manual
> approach to some users.
>
> 2. The above approach is obviously expensive, especially when there
> are a large number of elements to iterate over.  Because of this I'm
> thinking providing an option to use only one RNG seed per worker
> (which is the common approach used elsewhere)
> [https://github.com/HenrikBengtsson/future.apply/issues/20].  This
> won't be invariant to the number of workers, but it "should" still be
> statistically sound.  This approach will give reproducible RNG results
> given the same initial seed and the same amount of chunking.
>
> 3. For algorithms which do not rely on RNG, we can ignore both of the
> above.  The problem is that it's not always known to the
> user/developer which methods depend on RNG or not.  The above 'RNG
> tracker' helps to identify some, but things might also change over
> time.  I believe there's room for automating this in one way or the
> other.  For instance, having a way to declare a function being
> dependent on RNG or not could help.  Static code inspection could also
> do it, e.g. when an R package is built and it could be part of the R
> CMD checks to validate.
>
> 4. Are there other approaches?
>
> /Henrik
>
> >
> > The documented behavior is to us the RNGseed= argument to *Param, but I
> think it could be made consistent (by default, obey the global random
> number seed on workers) at least on a single machine (where the default
> number of cores is constant).
> >
> > I have not (yet?) changed the default behavior to SerialParam. I guess
> the cost of SerialParam is from the dependent packages that need to be
> loaded
> >
> > > system.time(suppressPackageStartupMessages(library(DelayedArray)))
> >    user  system elapsed
> >   3.068   0.082   3.150
> >
> > If fastMNN() makes several calls to bplapply(), it might make sense to
> start the default cluster at the top of the function once
> >
> >     if (!isup(bpparam())) {
> >         bpstart(bpparam())
> >         on.exit(bpstop(bpparam()))
> >     }
> >
> > Martin
> >
> > On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <
> bioc-devel-bounces using r-project.org on behalf of
> infinite.monkeys.with.keyboards using gmail.com> wrote:
> >
> >     As we know, the default BiocParallel backends are currently set to
> MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to
> some extent because a new user running, say, bplapply() without additional
> arguments or set-up would expect some kind of parallelization. However,
> from a developer’s perspective, I would argue that it makes more sense to
> use SerialParam() by default.
> >
> >     1. It avoids problems with MulticoreParam stalling (especially on
> Macs) when the randomly chosen port is in already use. This used to be a
> major problem, to the point that all my BiocParallel-using functions in
> scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as
> package default would ensure BiocParallel functions run properly in the
> first place; if the code stalls due to switching to MulticoreParam, then
> it’s obvious where the problem lies (and how to fix it).
> >
> >     2. It avoids the alteration of the random seed when the
> MulticoreParam instance is constructed for the first time.
> >
> >     library(BiocParallel) # new R session
> >     set.seed(100)
> >     invisible(bplapply(1:5, identity))
> >     rnorm(1) # 0.1315312
> >     set.seed(100)
> >     invisible(bplapply(1:5, identity))
> >     rnorm(1) # -0.5021924
> >
> >     This is because the first bplapply() call calls bpparam(), which
> constructs a MulticoreParam() for the first time; this calls the PRNG to
> choose a random port number. Ensuing random numbers are altered, as seen
> above. To avoid this, I need to define the MulticoreParam() object prior to
> set.seed(), which undermines the utility of a default-defined bpparam().
> >
> >     3. Job dispatch via SnowParam() is quite slow, which potentially
> makes Windows package builds run slower by default. A particularly bad
> example is that of scran::fastMNN(), which has a few matrix multiplications
> that use DelayedArray:%*%. The %*% is parallelized with the default
> bpparam(), resulting in SNOW parallelization on Windows. This slowed down
> fastMNN()’s examples from 4 seconds (unix) to ~100 seconds (windows).
> Clearly, serial execution is the faster option here. A related problem is
> MulticoreParam()’s tendency to copy the environment, which may result in
> problems from inflated memory consumption.
> >
> >     So, can we default to SerialParam() on all platforms? And by this I
> mean the BiocParallel in-built default - I don’t want to have to instruct
> all my users to put a “register(SerialParam())” at the start of their
> analysis scripts. I feel that BiocParallel’s job is to provide downstream
> code with the potential for parallelization. If end-users want actual
> parallelization, they had better be prepared to specify an appropriate
> scheme via *Param() objects.
> >
> >     -A
> >
> >
> >
> >
> >         [[alternative HTML version deleted]]
> >
> >     _______________________________________________
> >     Bioc-devel using r-project.org mailing list
> >     https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> > _______________________________________________
> > Bioc-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list