[Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
Serguei Sokol
sokol at insa-toulouse.fr
Fri Nov 3 15:56:38 CET 2017
Le 03/11/2017 à 14:24, Tirthankar Chakravarty a écrit :
> Martin,
>
> Thanks for the helpful reply. Alas I had forgotten that (implied)
> unfavorable comparisons of *nix systems with Windows systems would likely
> draw irate (but always substantive) responses on the R-devel list -- poor
> phrasing on my part. :)
>
> Regardless, let me try to address some of the concerns related to the
> construction of the MRE itself and try to see if we can clean away the
> shrubbery & zero down on the core issue, since I continue to believe that
> this is an issue with either R's implementation or a bad interaction of the
> seeds supplied with the Mersenne-Twister algorithm itself.
Is there an issue or not may depend on how the vector 'seeds' was obtained.
If we simply do:
r=range(seeds)
s=seq(r[1], r[2])
# pick up seeds giving the runif() in (25; 26) interval
s25=s[sapply(s, function(i) {set.seed(i); runif(1, 17, 26) > 25})]
all(seeds %in% s25) # TRUE
length(s25)/diff(r) # 0.1107351
Thus, the proportion of such seeds is about 1/9 which is coherent with
the fraction of the interval (25; 26) in (17; 26).
Now, you can pick up any 21 numbers from s25 vector (which is 48631 long) and say
"Look! It's weird, all values drawn by runif() are > 25!"
But s25 has nothing strange by itself. If we plot kind of cumulative distribution
plot(s25, type="l")
It shows a distribution very close to uniform which means that such seeds
are not grouped more densely or rarely somewhere.
So, how your set of seeds was obtained?
Best,
Serguei.
> The latter would
> require a deeper understanding of the algorithm than I have at the moment.
> If we can rule out the former through this thread, then I will pursue the
> latter solution path.
>
> Responses inline below, but summarizing:
>
> 1. All examples now are run using "R CMD BATCH --vanilla" as you have
> suggested, to ensure that no other loaded packages or namespace changes
> have interfered with the behaviour of `set.seed`.
> 2. Converting the character vector to integer vector has no impact on the
> output.
> 3. Upgrading to the latest version of R has no impact on the output.
> 4. Multiplying the seed vector by 10L causes the behaviour to vanish,
> calling into question the large integer theory.
>
>
> On Fri, Nov 3, 2017 at 3:09 PM, Martin Maechler <maechler at stat.math.ethz.ch>
> wrote:
>
>> Why R-devel -- R-help would have been appropriate:
>>
>> It seems you have not read the help page for
>> set.seed as I expect it from posters to R-devel.
>> Why would you use strings instead of integers if you *had* read it ?
>>
> The manual (which we did read) says:
>
> seed a single value, interpreted as an integer,
>
> We were confident of R coercing characters to integers correctly. We
> tested, prior to making this posting that the behaviour remains intact if
> we change the `seeds` variable from a character vector to the "equivalent"
> integer vector by hand.
>
>> seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L,
> 86584272L,
> + 86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
> + 86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
> + 86901193L, 86987847L, 86988080L)
>> random_values = sapply(seeds, function(x) {
> + set.seed(x)
> + y = runif(1, 17, 26)
> + return(y)
> + })
>> summary(random_values)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 25.13 25.36 25.66 25.58 25.83 25.94
>
>
>
>> > We are facing a weird situation in our code when using R's
>> > [`runif`][1] and setting seed with `set.seed` with the
>> > `kind = NULL` option (which resolves, unless I am
>> > mistaken, to `kind = "default"`; the default being
>> > `"Mersenne-Twister"`).
>>
>> again this is not what the help page says; rather
>>
>> | The use of ‘kind = NULL’ or ‘normal.kind = NULL’ in ‘RNGkind’ or
>> | ‘set.seed’ selects the currently-used generator (including that
>> | used in the previous session if the workspace has been restored):
>> | if no generator has been used it selects ‘"default"’.
>>
>> but as you have > 90 (!!) packages in your sessionInfo() below,
>> why should we (or you) know if some of the things you did
>> before or (implicitly) during loading all these packages did not
>> change the RNG kind ?
>>
> Agreed. We are running this system in production, and we will need
> `set.seed` to behave reliably with this session, however, as you say, we
> are claiming that there is an issue with the PRNG, so should isolate to an
> environment that does not have any of the attendant potential confounding
> factors that come with having 90 packages loaded (did you count?).
>
> As mentioned above, we have rerun all examples using "R CMD BATCH
> --vanilla" and we can report that the output is unchanged.
>
>
>> > We set the seed using (8 digit) unique IDs generated by an
>> > upstream system, before calling `runif`:
>>
>> > seeds = c( "86548915", "86551615", "86566163",
>> > "86577411", "86584144", "86584272", "86620568",
>> > "86724613", "86756002", "86768593", "86772411",
>> > "86781516", "86794389", "86805854", "86814600",
>> > "86835092", "86874179", "86876466", "86901193",
>> > "86987847", "86988080")
>>
>> > random_values = sapply(seeds, function(x) {
>> > set.seed(x)
>> > y = runif(1, 17, 26)
>> > return(y)
>> > })
>>
>> Why do you do that?
>>
>> 1) You should set the seed *once*, not multiple times in one simulation.
>>
> This code is written like this since this seed is set every time the
> function (API) is called for call-level replicability. It doesn't make a
> lot of sense in an MRE, but this is a critical component of the larger
> function. We do acknowledge that for any one of the seeds in the vector
> `seeds` the vector of draws appears to have the uniform distribution.
>
>
>> 2) Assuming that your strings are correctly translated to integers
>> and the same on all platforms, independent of locales (!) etc,
>> you are again not following the simple instruction on the help page:
>>
>> ‘set.seed’ uses a single integer argument to set as many seeds as
>> are required. It is intended as a simple way to get quite
>> different seeds by specifying small integer arguments, and also as
>> .....
>> .....
>>
>> Note: ** small ** integer
>> Why do you assume 86901193 to be a small integer ?
>>
> Because 86901193/2^32 = 0.02. What is a "small integer"?
>
>
>> > This gives values that are **extremely** bunched together.
>>
>> >> summary(random_values)
>> > Min. 1st Qu. Median Mean 3rd Qu. Max. 25.13
>> > 25.36 25.66 25.58 25.83 25.94
>>
>> > This behaviour of `runif` goes away when we use `kind =
>> > "Knuth-TAOCP-2002"`, and we get values that appear to be
>> > much more evenly spread out.
>>
>> > random_values = sapply(seeds, function(x) {
>> > set.seed(x, kind = "Knuth-TAOCP-2002") y = runif(1, 17,
>> > 26) return(y) })
>>
>> > *Output omitted.*
>>
>> > ---
>>
>> > **The most interesting thing here is that this does not
>> > happen on Windows -- only happens on Ubuntu**
>> > (`sessionInfo` output for Ubuntu & Windows below).
>>
>> > # Windows output: #
>>
>> >> seeds = c(
>> > + "86548915", "86551615", "86566163", "86577411",
>> > "86584144", + "86584272", "86620568", "86724613",
>> > "86756002", "86768593", "86772411", + "86781516",
>> > "86794389", "86805854", "86814600", "86835092",
>> > "86874179", + "86876466", "86901193", "86987847",
>> > "86988080")
>> >>
>> >> random_values = sapply(seeds, function(x) {
>> > + set.seed(x) + y = runif(1, 17, 26) + return(y) + })
>> >>
>> >> summary(random_values)
>> > Min. 1st Qu. Median Mean 3rd Qu. Max. 17.32
>> > 20.14 23.00 22.17 24.07 25.90
>>
>> > Can someone help understand what is going on?
>>
>> > Ubuntu
>> > ------
>>
>> > R version 3.4.0 (2017-04-21)
>> > Platform: x86_64-pc-linux-gnu (64-bit)
>> > Running under: Ubuntu 16.04.2 LTS
>>
>> You have not learned to get a current version of R.
>> ===> You should not write to R-devel (sorry if this may sound harsh ..)
>>
> We do spend a while on certain versions of R since upgrading our systems in
> production is not something we are able to do frequently & this version is
> only 6 months old. However, addressing your concern, upgrading to R 3.4.2
> leaves the output unchanged.
>
>
>> - - - - -
>> After doing all this, your problem may still be just
>> because you are using much too large integers for the 'seed'
>> argument of set.seed()
>>
> Note that multiplying the reported set of seeds by 10, results in expected
> output, so not clear if there is a sweet spot that bugs out the
> Mersenne-Twister algorithm:
>
> seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L, 86584272L,
> 86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
> 86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
> 86901193L, 86987847L, 86988080L)*10
>
> random_values = sapply(seeds, function(x) {
> set.seed(x)
> y = runif(1, 17, 26)
> return(y)
> })
>
> summary(random_values)
>
>
>
>> I really really strongly believe you should have used R-help
>> instead of R-devel.
>>
>> Best,
>> Martin Maechler
>>
> If you continue to believe with the inputs given in this reply that this
> should be on R-help, we will switch over.
>
> Your continued help would be appreciated in understanding the issue.
>
> T
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Serguei Sokol
Ingenieur de recherche INRA
Cellule mathématique
LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04
tel: +33 5 6155 9849
email: sokol at insa-toulouse.fr
http://www.lisbp.fr
More information about the R-devel
mailing list