[Bioc-devel] Issue with change in random sampling

Elizabeth Purdom epurdom @end|ng |rom berke|ey@edu
Fri Sep 24 20:05:03 CEST 2021


Hello,

My package `clusterExperiment` has not changed but is hitting errors on the devel branch. I’ve pinpointed it to the fact that a small dataset I am running the tests on is randomly subsetted from a larger subset and is no longer choosing the same observations. I have already in previous version corrected the tests for the change in random number generation in R.4.0.x. I am wondering if it is related to the changes in BiocParallel (https://community-bioc.slack.com/archives/CEQ04GKEC/p1631903391030800?thread_ts=1631881095.027600&cid=CEQ04GKEC <https://community-bioc.slack.com/archives/CEQ04GKEC/p1631903391030800?thread_ts=1631881095.027600&cid=CEQ04GKEC>).

It was unexpected for me that this would affect these results. My package doesn’t use BiocParallel or depend on it. But it turns out the code in question does make a call to BiocSingular to run a PCA, and BiocSingular does make calls to BiocParallel. What is strange to me is that even if I don’t directly use the results of runPCA, but simply make the call to runPCA before running the code in question, the output of that code is changed. So this seems to me to indicate that the sequence of random numbers is being globally affected by the change, and not just internally to the results of calls to BiocParallel. I didn’t realize this was the case from the above discussion — I thought it would only affect output that directly relied on calls to BiocParallel — and I was hoping someone could confirm that this is what is happening and/or give me explicit way to check this is the source of my errors. 

Here’s the basic setup. I have a setup file that sets up a lot of objects for my tests (setup_create_objects.R). The relevant parts look something like this (I’ve simplified it from what’s actually in the file so it more clearly shows the progression):

data(simData)
suppressWarnings(RNGversion("3.5.0"))
set.seed(23)

… # bunch of code

clusterIds<- … # code that internally calls BiocSingular::runPCA

… # bunch of code

### sample 3 observations from each cluster:
whSamp<-unlist(tapply(1:ncol(simData),clusterIds,function(x){sample(x=x,size=3)}))
smSimData<-simData[1:20,whSamp]

This results in different values of clusterIds and thus different whSamp on the release and the devel version.

The unexpected part was even if I add a line that manually overwrites clusterIds to be the values of the vector `clusterIds` from the release version (copied manually from running on a different computer that is not the devel version) I don’t get the same result of whSamp (I still run the code for `clusterIds`, so BiocSingular::runPCA is still being called). If, however, when I manually feed the correct clusterIds on the devel version, I ALSO put in a new call to `set.seed` in the line before calling whSamp then both the devel and the release version give the same result, as I would expect. This makes me think that that the random seed has been affected globally. Further, the second entry of .Random.seed is not the same after running setup_create_objects.R on the devel version as the new version. 

Thanks,
Elizabeth Purdom



	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list