[Bioc-devel] Using SerialParam() as the registered back-end for all platforms

Aaron Lun infinite@monkey@@with@keybo@rd@ @ending from gm@il@com
Tue Jan 8 09:41:58 CET 2019


Good point Kasper, I hadn’t even considered the possibility of that. This has opened an unpleasant box of worms...

I haven’t noticed any issues with MT, but I’ve just switched my DropletUtils C++ code from boost::random::mt19937 to dqrng’s pcg32 to avoid potential problems with overlaps between streams. And to avoid MT seeding issues (32-bit seed for a 19937-bit state).

-A

> On 8 Jan 2019, at 03:45, Kasper Daniel Hansen <kasperdanielhansen using gmail.com> wrote:
> 
> To add to Henrik's comments, it is also worthwhile to recognize that
> mclapply() does not deliver statistically sound random numbers even within
> the apply "loop" unless you use RNGkind("L'Ecuyer-CMRG") which is not set
> as default. This is because mclapply will initialize random streams with
> different seeds and most RNGs does not guarantee that different seeds leads
> to independent streams (but L'Ecuyer does).
> 
> Reproducibility is nice, but so is statistical correctness.
> 
> Best,
> Kasper
> 
> 
> On Mon, Jan 7, 2019 at 6:26 PM Henrik Bengtsson <henrik.bengtsson using gmail.com>
> wrote:
> 
>> On Mon, Jan 7, 2019 at 4:09 AM Martin Morgan <mtmorgan.bioc using gmail.com>
>> wrote:
>>> 
>>> I hope for 1. to have a 'local socket' (i.e., not using ports)
>> implementation shortly.
>>> 
>>> I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We
>> now have
>>> 
>>>> library(BiocParallel)
>>>> set.seed(1); p = bpparam(); rnorm(1)
>>> [1] -0.6264538
>>>> set.seed(1); p = bpparam(); rnorm(1)
>>> [1] -0.6264538
>>> 
>>> at the expensive of using the generator when the package is loaded.
>>> 
>>>> set.seed(1); rnorm(1)
>>> [1] -0.6264538
>>>> set.seed(1); library(BiocParallel); rnorm(1)
>>> [1] 0.1836433
>>> 
>>> Is that bad? It will be consistent across platforms.
>> 
>> Contrary to resampling methods etc., the RNG for generating random
>> ports does _not_ have to be statistically sound.  Because of this, you
>> should be able to generate a random port while *preserving the RNG
>> state*, i.e. undoing .GlobalEnv$.Random.seed (including removing if in
>> a fresh R session).   I decided to add this approach for
>> future::makeClusterPSOCK()
>> [
>> https://github.com/HenrikBengtsson/future/blob/073e1d1/R/makeClusterPSOCK.R#L76-L82
>> ].
>> 
>> BTW, here's a useful development tool for detecting when the
>> .Random.seed has been updated in an interactive session:
>> 
>> addTaskCallback(local({
>>  last <- .GlobalEnv$.Random.seed
>>  function(...) {
>>    curr <- .GlobalEnv$.Random.seed
>>    if (!identical(curr, last)) {
>>      warning(".Random.seed changed", call. = FALSE, immediate. = TRUE)
>>      last <<- curr
>>    }
>>    TRUE
>>  }
>> }), name = "RNG tracker")
>> 
>>> 
>>> This behavior
>>> 
>>>> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
>>> [1] 0.9624337 0.8925947
>>>> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
>>> [1] -0.5703597  0.1102093
>>> 
>>> seems wrong, but is consistent with mclapply
>>> 
>>>> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
>>> [1] -0.02704527  0.40721777
>>>> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
>>> [1] -0.8239765  1.2957928
>> 
>> FWIW, without having invested much time tracking down why the design
>> is what it is, I'm getting concerned about how parallel::mclapply()
>> does parallel RNG out of the box when RNGkind("L'Ecuyer-CMRG") is in
>> use.  For example:
>> 
>>> library(parallel)
>>> RNGkind("L'Ecuyer-CMRG")
>>> unlist(mclapply(1:2, function(i) rnorm(1)))  ## mc.set.seed = TRUE
>> [1] -0.71727  0.56171
>>> unlist(mclapply(1:2, function(i) rnorm(1)))
>> [1] -0.71727  0.56171
>>> unlist(mclapply(1:2, function(i) rnorm(1)))
>> [1] -0.71727  0.56171
>> 
>> I wonder how many resampling/bootstrap studies are bitten by this
>> behavior.  Same if you use mc.set.seed = FALSE;
>> 
>>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
>> [1] -0.8518311 -0.8518311
>>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
>> [1] -0.8518311 -0.8518311
>>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
>> [1] -0.8518311 -0.8518311
>> 
>> 
>> Here is how I think about parallel RNG right now:
>> 
>> 1. To achieve fully numerically reproducible RNGs in way that is
>> *invariant to the number of workers* (amount of chunking), I think the
>> only solution is to pregenerated RNG seeds (using
>> parallel::nextRNGStream()) for each individual iteration (element).
>> In other words, if a worker will process K elements, then the main R
>> process needs to generate K RNG seeds and pass those along to the
>> work.  I use this approach for future.apply::future_lapply(...,
>> future.seed = TRUE/<initial_seed>), which then produce identical RNG
>> results regardless of backend and amount of chunking.  In the past, I
>> think I've seen Martin suggesting something similar as a manual
>> approach to some users.
>> 
>> 2. The above approach is obviously expensive, especially when there
>> are a large number of elements to iterate over.  Because of this I'm
>> thinking providing an option to use only one RNG seed per worker
>> (which is the common approach used elsewhere)
>> [https://github.com/HenrikBengtsson/future.apply/issues/20].  This
>> won't be invariant to the number of workers, but it "should" still be
>> statistically sound.  This approach will give reproducible RNG results
>> given the same initial seed and the same amount of chunking.
>> 
>> 3. For algorithms which do not rely on RNG, we can ignore both of the
>> above.  The problem is that it's not always known to the
>> user/developer which methods depend on RNG or not.  The above 'RNG
>> tracker' helps to identify some, but things might also change over
>> time.  I believe there's room for automating this in one way or the
>> other.  For instance, having a way to declare a function being
>> dependent on RNG or not could help.  Static code inspection could also
>> do it, e.g. when an R package is built and it could be part of the R
>> CMD checks to validate.
>> 
>> 4. Are there other approaches?
>> 
>> /Henrik
>> 
>>> 
>>> The documented behavior is to us the RNGseed= argument to *Param, but I
>> think it could be made consistent (by default, obey the global random
>> number seed on workers) at least on a single machine (where the default
>> number of cores is constant).
>>> 
>>> I have not (yet?) changed the default behavior to SerialParam. I guess
>> the cost of SerialParam is from the dependent packages that need to be
>> loaded
>>> 
>>>> system.time(suppressPackageStartupMessages(library(DelayedArray)))
>>>   user  system elapsed
>>>  3.068   0.082   3.150
>>> 
>>> If fastMNN() makes several calls to bplapply(), it might make sense to
>> start the default cluster at the top of the function once
>>> 
>>>    if (!isup(bpparam())) {
>>>        bpstart(bpparam())
>>>        on.exit(bpstop(bpparam()))
>>>    }
>>> 
>>> Martin
>>> 
>>> On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <
>> bioc-devel-bounces using r-project.org on behalf of
>> infinite.monkeys.with.keyboards using gmail.com> wrote:
>>> 
>>>    As we know, the default BiocParallel backends are currently set to
>> MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to
>> some extent because a new user running, say, bplapply() without additional
>> arguments or set-up would expect some kind of parallelization. However,
>> from a developer’s perspective, I would argue that it makes more sense to
>> use SerialParam() by default.
>>> 
>>>    1. It avoids problems with MulticoreParam stalling (especially on
>> Macs) when the randomly chosen port is in already use. This used to be a
>> major problem, to the point that all my BiocParallel-using functions in
>> scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as
>> package default would ensure BiocParallel functions run properly in the
>> first place; if the code stalls due to switching to MulticoreParam, then
>> it’s obvious where the problem lies (and how to fix it).
>>> 
>>>    2. It avoids the alteration of the random seed when the
>> MulticoreParam instance is constructed for the first time.
>>> 
>>>    library(BiocParallel) # new R session
>>>    set.seed(100)
>>>    invisible(bplapply(1:5, identity))
>>>    rnorm(1) # 0.1315312
>>>    set.seed(100)
>>>    invisible(bplapply(1:5, identity))
>>>    rnorm(1) # -0.5021924
>>> 
>>>    This is because the first bplapply() call calls bpparam(), which
>> constructs a MulticoreParam() for the first time; this calls the PRNG to
>> choose a random port number. Ensuing random numbers are altered, as seen
>> above. To avoid this, I need to define the MulticoreParam() object prior to
>> set.seed(), which undermines the utility of a default-defined bpparam().
>>> 
>>>    3. Job dispatch via SnowParam() is quite slow, which potentially
>> makes Windows package builds run slower by default. A particularly bad
>> example is that of scran::fastMNN(), which has a few matrix multiplications
>> that use DelayedArray:%*%. The %*% is parallelized with the default
>> bpparam(), resulting in SNOW parallelization on Windows. This slowed down
>> fastMNN()’s examples from 4 seconds (unix) to ~100 seconds (windows).
>> Clearly, serial execution is the faster option here. A related problem is
>> MulticoreParam()’s tendency to copy the environment, which may result in
>> problems from inflated memory consumption.
>>> 
>>>    So, can we default to SerialParam() on all platforms? And by this I
>> mean the BiocParallel in-built default - I don’t want to have to instruct
>> all my users to put a “register(SerialParam())” at the start of their
>> analysis scripts. I feel that BiocParallel’s job is to provide downstream
>> code with the potential for parallelization. If end-users want actual
>> parallelization, they had better be prepared to specify an appropriate
>> scheme via *Param() objects.
>>> 
>>>    -A
>>> 
>>> 
>>> 
>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>>    _______________________________________________
>>>    Bioc-devel using r-project.org mailing list
>>>    https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> 
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list