[Bioc-devel] Issue with change in random sampling

Fri Sep 24 21:23:06 CEST 2021

The random number generator is tracked by a global state, which is a
violation of functional programming. This is unavoidable. This is why we
really don't want packages to EVER touch the random number state for
example by setting the seed. You can also be affected by this if any
function you depend on decides to add a call to the random number generator.

For the example you describe here, a more robust approach would be to call
set.seed() just prior to calling sample().

While different, I have been bitten by these kinds of test failures when
packages I depend upon have decided to change certain numerical routines.
In theory, I think that's a good feature since it alerts the developer who
can make a decision whether the dependency should change (I have once
dropped a package from a dependency list because of this). In practice
however, such tests can become EXTREMELY irritating when you get a failure
out of the blue without any idea why. I don't have any good solutions to
this.

In this case, I think Martin alerted the community to the changes, it is
just easy to miss. A potential improved mechanism would be to have the
option to have "message-of-the-day" attached to build reports to alert
developers to possible changes. This could include messages such as "we
know windows is failing right now because we are maintaining the build
servers". If anyone entertains this, we would want the ability to update
the message in real time and not just when the build system runs. It might
not be too hard to just have a link to common text file though.

Best,
Kasper

On Fri, Sep 24, 2021 at 2:16 PM Martin Morgan <mtmorgan.bioc using gmail.com>
wrote:

> This does sound like a BiocParallel side effect, and I would suggest
> holding off for another week so for the BiocParallel changes to be
> finalized.
> On 9/24/21, 2:05 PM, "Bioc-devel" <bioc-devel-bounces using r-project.org>
> wrote:
>
> Hello,
>
> My package `clusterExperiment` has not changed but is hitting errors on the
> devel branch. I’ve pinpointed it to the fact that a small dataset I am
> running the tests on is randomly subsetted from a larger subset and is no
> longer choosing the same observations. I have already in previous version
> corrected the tests for the change in random number generation in R.4.0.x.
> I am wondering if it is related to the changes in BiocParallel (
>
> https://community-bioc.slack.com/archives/CEQ04GKEC/p1631903391030800?thread_ts=1631881095.027600&cid=CEQ04GKEC
> <
>
> https://community-bioc.slack.com/archives/CEQ04GKEC/p1631903391030800?thread_ts=1631881095.027600&cid=CEQ04GKEC
> >
> ).
>
> It was unexpected for me that this would affect these results. My package
> doesn’t use BiocParallel or depend on it. But it turns out the code in
> question does make a call to BiocSingular to run a PCA, and BiocSingular
> does make calls to BiocParallel. What is strange to me is that even if I
> don’t directly use the results of runPCA, but simply make the call to
> runPCA before running the code in question, the output of that code is
> changed. So this seems to me to indicate that the sequence of random
> numbers is being globally affected by the change, and not just internally
> to the results of calls to BiocParallel. I didn’t realize this was the case
> from the above discussion — I thought it would only affect output that
> directly relied on calls to BiocParallel — and I was hoping someone could
> confirm that this is what is happening and/or give me explicit way to check
> this is the source of my errors.
>
> Here’s the basic setup. I have a setup file that sets up a lot of objects
> for my tests (setup_create_objects.R). The relevant parts look something
> like this (I’ve simplified it from what’s actually in the file so it more
> clearly shows the progression):
>
> data(simData)
> suppressWarnings(RNGversion("3.5.0"))
> set.seed(23)
>
> … # bunch of code
>
> clusterIds<- … # code that internally calls BiocSingular::runPCA
>
> … # bunch of code
>
> ### sample 3 observations from each cluster:
>
> whSamp<-unlist(tapply(1:ncol(simData),clusterIds,function(x){sample(x=x,size=3)}))
> smSimData<-simData[1:20,whSamp]
>
> This results in different values of clusterIds and thus different whSamp on
> the release and the devel version.
>
> The unexpected part was even if I add a line that manually overwrites
> clusterIds to be the values of the vector `clusterIds` from the release
> version (copied manually from running on a different computer that is not
> the devel version) I don’t get the same result of whSamp (I still run the
> code for `clusterIds`, so BiocSingular::runPCA is still being called). If,
> however, when I manually feed the correct clusterIds on the devel version,
> I ALSO put in a new call to `set.seed` in the line before calling whSamp
> then both the devel and the release version give the same result, as I
> would expect. This makes me think that that the random seed has been
> affected globally. Further, the second entry of .Random.seed is not the
> same after running setup_create_objects.R on the devel version as the new
> version.
>
> Thanks,
> Elizabeth Purdom
>
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Best,
Kasper

	[[alternative HTML version deleted]]