[Bioc-devel] Using SerialParam() as the registered back-end for all platforms
Aaron Lun
infinite@monkey@@with@keybo@rd@ @ending from gm@il@com
Tue Jan 8 09:41:58 CET 2019
Good point Kasper, I hadn’t even considered the possibility of that. This has opened an unpleasant box of worms...
I haven’t noticed any issues with MT, but I’ve just switched my DropletUtils C++ code from boost::random::mt19937 to dqrng’s pcg32 to avoid potential problems with overlaps between streams. And to avoid MT seeding issues (32-bit seed for a 19937-bit state).
-A
> On 8 Jan 2019, at 03:45, Kasper Daniel Hansen <kasperdanielhansen using gmail.com> wrote:
>
> To add to Henrik's comments, it is also worthwhile to recognize that
> mclapply() does not deliver statistically sound random numbers even within
> the apply "loop" unless you use RNGkind("L'Ecuyer-CMRG") which is not set
> as default. This is because mclapply will initialize random streams with
> different seeds and most RNGs does not guarantee that different seeds leads
> to independent streams (but L'Ecuyer does).
>
> Reproducibility is nice, but so is statistical correctness.
>
> Best,
> Kasper
>
>
> On Mon, Jan 7, 2019 at 6:26 PM Henrik Bengtsson <henrik.bengtsson using gmail.com>
> wrote:
>
>> On Mon, Jan 7, 2019 at 4:09 AM Martin Morgan <mtmorgan.bioc using gmail.com>
>> wrote:
>>>
>>> I hope for 1. to have a 'local socket' (i.e., not using ports)
>> implementation shortly.
>>>
>>> I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We
>> now have
>>>
>>>> library(BiocParallel)
>>>> set.seed(1); p = bpparam(); rnorm(1)
>>> [1] -0.6264538
>>>> set.seed(1); p = bpparam(); rnorm(1)
>>> [1] -0.6264538
>>>
>>> at the expensive of using the generator when the package is loaded.
>>>
>>>> set.seed(1); rnorm(1)
>>> [1] -0.6264538
>>>> set.seed(1); library(BiocParallel); rnorm(1)
>>> [1] 0.1836433
>>>
>>> Is that bad? It will be consistent across platforms.
>>
>> Contrary to resampling methods etc., the RNG for generating random
>> ports does _not_ have to be statistically sound. Because of this, you
>> should be able to generate a random port while *preserving the RNG
>> state*, i.e. undoing .GlobalEnv$.Random.seed (including removing if in
>> a fresh R session). I decided to add this approach for
>> future::makeClusterPSOCK()
>> [
>> https://github.com/HenrikBengtsson/future/blob/073e1d1/R/makeClusterPSOCK.R#L76-L82
>> ].
>>
>> BTW, here's a useful development tool for detecting when the
>> .Random.seed has been updated in an interactive session:
>>
>> addTaskCallback(local({
>> last <- .GlobalEnv$.Random.seed
>> function(...) {
>> curr <- .GlobalEnv$.Random.seed
>> if (!identical(curr, last)) {
>> warning(".Random.seed changed", call. = FALSE, immediate. = TRUE)
>> last <<- curr
>> }
>> TRUE
>> }
>> }), name = "RNG tracker")
>>
>>>
>>> This behavior
>>>
>>>> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
>>> [1] 0.9624337 0.8925947
>>>> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
>>> [1] -0.5703597 0.1102093
>>>
>>> seems wrong, but is consistent with mclapply
>>>
>>>> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
>>> [1] -0.02704527 0.40721777
>>>> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
>>> [1] -0.8239765 1.2957928
>>
>> FWIW, without having invested much time tracking down why the design
>> is what it is, I'm getting concerned about how parallel::mclapply()
>> does parallel RNG out of the box when RNGkind("L'Ecuyer-CMRG") is in
>> use. For example:
>>
>>> library(parallel)
>>> RNGkind("L'Ecuyer-CMRG")
>>> unlist(mclapply(1:2, function(i) rnorm(1))) ## mc.set.seed = TRUE
>> [1] -0.71727 0.56171
>>> unlist(mclapply(1:2, function(i) rnorm(1)))
>> [1] -0.71727 0.56171
>>> unlist(mclapply(1:2, function(i) rnorm(1)))
>> [1] -0.71727 0.56171
>>
>> I wonder how many resampling/bootstrap studies are bitten by this
>> behavior. Same if you use mc.set.seed = FALSE;
>>
>>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
>> [1] -0.8518311 -0.8518311
>>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
>> [1] -0.8518311 -0.8518311
>>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
>> [1] -0.8518311 -0.8518311
>>
>>
>> Here is how I think about parallel RNG right now:
>>
>> 1. To achieve fully numerically reproducible RNGs in way that is
>> *invariant to the number of workers* (amount of chunking), I think the
>> only solution is to pregenerated RNG seeds (using
>> parallel::nextRNGStream()) for each individual iteration (element).
>> In other words, if a worker will process K elements, then the main R
>> process needs to generate K RNG seeds and pass those along to the
>> work. I use this approach for future.apply::future_lapply(...,
>> future.seed = TRUE/<initial_seed>), which then produce identical RNG
>> results regardless of backend and amount of chunking. In the past, I
>> think I've seen Martin suggesting something similar as a manual
>> approach to some users.
>>
>> 2. The above approach is obviously expensive, especially when there
>> are a large number of elements to iterate over. Because of this I'm
>> thinking providing an option to use only one RNG seed per worker
>> (which is the common approach used elsewhere)
>> [https://github.com/HenrikBengtsson/future.apply/issues/20]. This
>> won't be invariant to the number of workers, but it "should" still be
>> statistically sound. This approach will give reproducible RNG results
>> given the same initial seed and the same amount of chunking.
>>
>> 3. For algorithms which do not rely on RNG, we can ignore both of the
>> above. The problem is that it's not always known to the
>> user/developer which methods depend on RNG or not. The above 'RNG
>> tracker' helps to identify some, but things might also change over
>> time. I believe there's room for automating this in one way or the
>> other. For instance, having a way to declare a function being
>> dependent on RNG or not could help. Static code inspection could also
>> do it, e.g. when an R package is built and it could be part of the R
>> CMD checks to validate.
>>
>> 4. Are there other approaches?
>>
>> /Henrik
>>
>>>
>>> The documented behavior is to us the RNGseed= argument to *Param, but I
>> think it could be made consistent (by default, obey the global random
>> number seed on workers) at least on a single machine (where the default
>> number of cores is constant).
>>>
>>> I have not (yet?) changed the default behavior to SerialParam. I guess
>> the cost of SerialParam is from the dependent packages that need to be
>> loaded
>>>
>>>> system.time(suppressPackageStartupMessages(library(DelayedArray)))
>>> user system elapsed
>>> 3.068 0.082 3.150
>>>
>>> If fastMNN() makes several calls to bplapply(), it might make sense to
>> start the default cluster at the top of the function once
>>>
>>> if (!isup(bpparam())) {
>>> bpstart(bpparam())
>>> on.exit(bpstop(bpparam()))
>>> }
>>>
>>> Martin
>>>
>>> On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <
>> bioc-devel-bounces using r-project.org on behalf of
>> infinite.monkeys.with.keyboards using gmail.com> wrote:
>>>
>>> As we know, the default BiocParallel backends are currently set to
>> MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to
>> some extent because a new user running, say, bplapply() without additional
>> arguments or set-up would expect some kind of parallelization. However,
>> from a developer’s perspective, I would argue that it makes more sense to
>> use SerialParam() by default.
>>>
>>> 1. It avoids problems with MulticoreParam stalling (especially on
>> Macs) when the randomly chosen port is in already use. This used to be a
>> major problem, to the point that all my BiocParallel-using functions in
>> scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as
>> package default would ensure BiocParallel functions run properly in the
>> first place; if the code stalls due to switching to MulticoreParam, then
>> it’s obvious where the problem lies (and how to fix it).
>>>
>>> 2. It avoids the alteration of the random seed when the
>> MulticoreParam instance is constructed for the first time.
>>>
>>> library(BiocParallel) # new R session
>>> set.seed(100)
>>> invisible(bplapply(1:5, identity))
>>> rnorm(1) # 0.1315312
>>> set.seed(100)
>>> invisible(bplapply(1:5, identity))
>>> rnorm(1) # -0.5021924
>>>
>>> This is because the first bplapply() call calls bpparam(), which
>> constructs a MulticoreParam() for the first time; this calls the PRNG to
>> choose a random port number. Ensuing random numbers are altered, as seen
>> above. To avoid this, I need to define the MulticoreParam() object prior to
>> set.seed(), which undermines the utility of a default-defined bpparam().
>>>
>>> 3. Job dispatch via SnowParam() is quite slow, which potentially
>> makes Windows package builds run slower by default. A particularly bad
>> example is that of scran::fastMNN(), which has a few matrix multiplications
>> that use DelayedArray:%*%. The %*% is parallelized with the default
>> bpparam(), resulting in SNOW parallelization on Windows. This slowed down
>> fastMNN()’s examples from 4 seconds (unix) to ~100 seconds (windows).
>> Clearly, serial execution is the faster option here. A related problem is
>> MulticoreParam()’s tendency to copy the environment, which may result in
>> problems from inflated memory consumption.
>>>
>>> So, can we default to SerialParam() on all platforms? And by this I
>> mean the BiocParallel in-built default - I don’t want to have to instruct
>> all my users to put a “register(SerialParam())” at the start of their
>> analysis scripts. I feel that BiocParallel’s job is to provide downstream
>> code with the potential for parallelization. If end-users want actual
>> parallelization, they had better be prepared to specify an appropriate
>> scheme via *Param() objects.
>>>
>>> -A
>>>
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list