[Bioc-devel] Using SerialParam() as the registered back-end for all platforms

Ryan Thompson rct @ending from thomp@oncl@n@org
Mon Jan 7 16:03:39 CET 2019


I don't know if this is helpful for BiocParallel, but there's an extension
for the foreach package that ensures reproducible RNG behavior for all
parallel backends: https://cran.r-project.org/web/packages/doRNG/index.html

Perhaps some of the principles from that package can be re-used?

On Mon, Jan 7, 2019 at 9:37 AM Aaron Lun <
infinite.monkeys.with.keyboards using gmail.com> wrote:

> > I hope for 1. to have a 'local socket' (i.e., not using ports)
> implementation shortly.
>
> Yes, that would be helpful.
>
> > I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We
> now have
> >
> >> library(BiocParallel)
> >> set.seed(1); p = bpparam(); rnorm(1)
> > [1] -0.6264538
> >> set.seed(1); p = bpparam(); rnorm(1)
> > [1] -0.6264538
> >
> > at the expensive of using the generator when the package is loaded.
> >
> >> set.seed(1); rnorm(1)
> > [1] -0.6264538
> >> set.seed(1); library(BiocParallel); rnorm(1)
> > [1] 0.1836433
> >
> > Is that bad? It will be consistent across platforms.
>
> Hm. I guess the changed behaviour is… better, in the sense that the second
> scenario (setting the seed before loading the package) is less likely in
> real analysis code.
>
> Even so, there are probably some edge cases where this could cause issues,
> e.g., when:
>
> set.seed(1)
> MyPackage::SomeFun()
>
> … where MyPackage causes BiocParallel to be attached, which presumably
> changes the seed.
>
> Having thought about it for a while, the fact that bpparam() changes the
> random seed is only a secondary issue. The main issue is that it doesn’t
> change the seed reproducibly. So I wouldn’t mind *as much* if repeated
> calls of:
>
> set.seed(1)
> bpparam()
> rnorm(1)
>
> … gave the same result, even if it were different from just running
> “set.seed(1); rnorm(1)”. (Mind you, I’d still mind a little, but it
> wouldn’t be so bad.) The biggest problem with the current state of affairs
> is that the first call gives different results to all subsequent calls,
> which really interferes with debugging attempts.
>
> > This behavior
> >
> >> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
> > [1] 0.9624337 0.8925947
> >> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
> > [1] -0.5703597  0.1102093
> >
> > seems wrong, but is consistent with mclapply
> >
> >> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
> > [1] -0.02704527  0.40721777
> >> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
> > [1] -0.8239765  1.2957928
> >
> > The documented behavior is to us the RNGseed= argument to *Param, but I
> think it could be made consistent (by default, obey the global random
> number seed on workers) at least on a single machine (where the default
> number of cores is constant).
>
> I’m less concerned with that behaviour, given it’s inherently hard to take
> randomization code written for serial execution and make it give the same
> results on multiple cores (as we discussed elsewhere).
>
> > I have not (yet?) changed the default behavior to SerialParam. I guess
> the cost of SerialParam is from the dependent packages that need to be
> loaded
> >
> >> system.time(suppressPackageStartupMessages(library(DelayedArray)))
> >   user  system elapsed
> >  3.068   0.082   3.150
>
> Does calling "SerialParam()” cause DelayedArray to be attached? That seems
> odd.
>
> > If fastMNN() makes several calls to bplapply(), it might make sense to
> start the default cluster at the top of the function once
> >
> >    if (!isup(bpparam())) {
> >        bpstart(bpparam())
> >        on.exit(bpstop(bpparam()))
> >    }
>
> This is probably a good idea to do in general to all of my parallelized
> functions, though I don’t know how much this will solve the time problem.
> Perhaps I should just do it and see.
>
> -A
>
> > On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <
> bioc-devel-bounces using r-project.org on behalf of
> infinite.monkeys.with.keyboards using gmail.com> wrote:
> >
> >    As we know, the default BiocParallel backends are currently set to
> MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to
> some extent because a new user running, say, bplapply() without additional
> arguments or set-up would expect some kind of parallelization. However,
> from a developer’s perspective, I would argue that it makes more sense to
> use SerialParam() by default.
> >
> >    1. It avoids problems with MulticoreParam stalling (especially on
> Macs) when the randomly chosen port is in already use. This used to be a
> major problem, to the point that all my BiocParallel-using functions in
> scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as
> package default would ensure BiocParallel functions run properly in the
> first place; if the code stalls due to switching to MulticoreParam, then
> it’s obvious where the problem lies (and how to fix it).
> >
> >    2. It avoids the alteration of the random seed when the
> MulticoreParam instance is constructed for the first time.
> >
> >    library(BiocParallel) # new R session
> >    set.seed(100)
> >    invisible(bplapply(1:5, identity))
> >    rnorm(1) # 0.1315312
> >    set.seed(100)
> >    invisible(bplapply(1:5, identity))
> >    rnorm(1) # -0.5021924
> >
> >    This is because the first bplapply() call calls bpparam(), which
> constructs a MulticoreParam() for the first time; this calls the PRNG to
> choose a random port number. Ensuing random numbers are altered, as seen
> above. To avoid this, I need to define the MulticoreParam() object prior to
> set.seed(), which undermines the utility of a default-defined bpparam().
> >
> >    3. Job dispatch via SnowParam() is quite slow, which potentially
> makes Windows package builds run slower by default. A particularly bad
> example is that of scran::fastMNN(), which has a few matrix multiplications
> that use DelayedArray:%*%. The %*% is parallelized with the default
> bpparam(), resulting in SNOW parallelization on Windows. This slowed down
> fastMNN()’s examples from 4 seconds (unix) to ~100 seconds (windows).
> Clearly, serial execution is the faster option here. A related problem is
> MulticoreParam()’s tendency to copy the environment, which may result in
> problems from inflated memory consumption.
> >
> >    So, can we default to SerialParam() on all platforms? And by this I
> mean the BiocParallel in-built default - I don’t want to have to instruct
> all my users to put a “register(SerialParam())” at the start of their
> analysis scripts. I feel that BiocParallel’s job is to provide downstream
> code with the potential for parallelization. If end-users want actual
> parallelization, they had better be prepared to specify an appropriate
> scheme via *Param() objects.
> >
> >    -A
> >
> >
> >
> >
> >       [[alternative HTML version deleted]]
> >
> >    _______________________________________________
> >    Bioc-devel using r-project.org mailing list
> >    https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list