[Rd] Extreme bunching of random values from runif with Mersenne-Twister seed

Tirthankar Chakravarty tirthankar.lists at gmail.com
Fri Nov 3 14:24:23 CET 2017


Martin,

Thanks for the helpful reply. Alas I had forgotten that (implied)
unfavorable comparisons of *nix systems with Windows systems would likely
draw irate (but always substantive) responses on the R-devel list -- poor
phrasing on my part. :)

Regardless, let me try to address some of the concerns related to the
construction of the MRE itself and try to see if we can clean away the
shrubbery & zero down on the core issue, since I continue to believe that
this is an issue with either R's implementation or a bad interaction of the
seeds supplied with the Mersenne-Twister algorithm itself. The latter would
require a deeper understanding of the algorithm than I have at the moment.
If we can rule out the former through this thread, then I will pursue the
latter solution path.

Responses inline below, but summarizing:

1. All examples now are run using "R CMD BATCH --vanilla" as you have
suggested, to ensure that no other loaded packages or namespace changes
have interfered with the behaviour of `set.seed`.
2. Converting the character vector to integer vector has no impact on the
output.
3. Upgrading to the latest version of R has no impact on the output.
4. Multiplying the seed vector by 10L causes the behaviour to vanish,
calling into question the large integer theory.


On Fri, Nov 3, 2017 at 3:09 PM, Martin Maechler <maechler at stat.math.ethz.ch>
wrote:

> Why R-devel -- R-help would have been appropriate:
>

> It seems you have not read the help page for
> set.seed as I expect it from posters to R-devel.
> Why would you use strings instead of integers if you *had* read it ?
>

The manual (which we did read) says:

seed a single value, interpreted as an integer,

We were confident of R coercing characters to integers correctly. We
tested, prior to making this posting that the behaviour remains intact if
we change the `seeds` variable from a character vector to the "equivalent"
integer vector by hand.

> seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L,
86584272L,
+   86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
+   86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
+   86901193L, 86987847L, 86988080L)
>
> random_values = sapply(seeds, function(x) {
+   set.seed(x)
+   y = runif(1, 17, 26)
+   return(y)
+ })
>
> summary(random_values)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  25.13   25.36   25.66   25.58   25.83   25.94



>     > We are facing a weird situation in our code when using R's
>     > [`runif`][1] and setting seed with `set.seed` with the
>     > `kind = NULL` option (which resolves, unless I am
>     > mistaken, to `kind = "default"`; the default being
>     > `"Mersenne-Twister"`).
>
> again this is not what the help page says; rather
>
>  | The use of ‘kind = NULL’ or ‘normal.kind = NULL’ in ‘RNGkind’ or
>  | ‘set.seed’ selects the currently-used generator (including that
>  | used in the previous session if the workspace has been restored):
>  | if no generator has been used it selects ‘"default"’.
>
> but as you have > 90 (!!) packages in your sessionInfo() below,
> why should we (or you) know if some of the things you did
> before or (implicitly) during loading all these packages did not
> change the RNG kind ?
>

Agreed. We are running this system in production, and we will need
`set.seed` to behave reliably with this session, however, as you say, we
are claiming that there is an issue with the PRNG, so should isolate to an
environment that does not have any of the attendant potential confounding
factors that come with having 90 packages loaded (did you count?).

As mentioned above, we have rerun all examples using "R CMD BATCH
--vanilla" and we can report that the output is unchanged.


>
>     > We set the seed using (8 digit) unique IDs generated by an
>     > upstream system, before calling `runif`:
>
>     >     seeds = c( "86548915", "86551615", "86566163",
>     > "86577411", "86584144", "86584272", "86620568",
>     > "86724613", "86756002", "86768593", "86772411",
>     > "86781516", "86794389", "86805854", "86814600",
>     > "86835092", "86874179", "86876466", "86901193",
>     > "86987847", "86988080")
>
>     >  random_values = sapply(seeds, function(x) {
>     >   set.seed(x)
>     >   y = runif(1, 17, 26)
>     >   return(y)
>     > })
>
> Why do you do that?
>
> 1) You should set the seed *once*, not multiple times in one simulation.
>

This code is written like this since this seed is set every time the
function (API) is called for call-level replicability. It doesn't make a
lot of sense in an MRE, but this is a critical component of the larger
function. We do acknowledge that for any one of the seeds in the vector
`seeds` the vector of draws appears to have the uniform distribution.


> 2) Assuming that your strings are correctly translated to integers
>    and the same on all platforms, independent of locales (!) etc,
>    you are again not following the simple instruction on the help page:
>
>      ‘set.seed’ uses a single integer argument to set as many seeds as
>      are required.  It is intended as a simple way to get quite
>      different seeds by specifying small integer arguments, and also as
>      .....
>      .....
>
> Note:   ** small ** integer
> Why do you assume   86901193  to be a small integer ?
>

Because 86901193/2^32 = 0.02. What is a "small integer"?


>
>     > This gives values that are **extremely** bunched together.
>
>     >> summary(random_values)
>     >        Min. 1st Qu.  Median Mean 3rd Qu.  Max.  25.13
>     > 25.36 25.66 25.58 25.83 25.94
>
>     > This behaviour of `runif` goes away when we use `kind =
>     > "Knuth-TAOCP-2002"`, and we get values that appear to be
>     > much more evenly spread out.
>
>     >     random_values = sapply(seeds, function(x) {
>     > set.seed(x, kind = "Knuth-TAOCP-2002") y = runif(1, 17,
>     > 26) return(y) })
>
>     > *Output omitted.*
>
>     > ---
>
>     > **The most interesting thing here is that this does not
>     > happen on Windows -- only happens on Ubuntu**
>     > (`sessionInfo` output for Ubuntu & Windows below).
>
>     > # Windows output: #
>
>     >> seeds = c(
>     >     + "86548915", "86551615", "86566163", "86577411",
>     > "86584144", + "86584272", "86620568", "86724613",
>     > "86756002", "86768593", "86772411", + "86781516",
>     > "86794389", "86805854", "86814600", "86835092",
>     > "86874179", + "86876466", "86901193", "86987847",
>     > "86988080")
>     >>
>     >> random_values = sapply(seeds, function(x) {
>     >     + set.seed(x) + y = runif(1, 17, 26) + return(y) + })
>     >>
>     >> summary(random_values)
>     >        Min. 1st Qu.  Median Mean 3rd Qu.  Max.  17.32
>     > 20.14 23.00 22.17 24.07 25.90
>
>     > Can someone help understand what is going on?
>
>     > Ubuntu
>     > ------
>
>     > R version 3.4.0 (2017-04-21)
>     > Platform: x86_64-pc-linux-gnu (64-bit)
>     > Running under: Ubuntu 16.04.2 LTS
>
> You have not learned to get a current version of R.
> ===> You should not write to R-devel (sorry if this may sound harsh ..)
>

We do spend a while on certain versions of R since upgrading our systems in
production is not something we are able to do frequently & this version is
only 6 months old. However, addressing your concern, upgrading to R 3.4.2
leaves the output unchanged.


> - - - - -
> After doing all this, your problem may still be just
> because you are using much too large integers for the 'seed'
> argument of set.seed()
>

Note that multiplying the reported set of seeds by 10, results in expected
output, so not clear if there is a sweet spot that bugs out the
Mersenne-Twister algorithm:

seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L, 86584272L,
  86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
  86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
  86901193L, 86987847L, 86988080L)*10

random_values = sapply(seeds, function(x) {
  set.seed(x)
  y = runif(1, 17, 26)
  return(y)
})

summary(random_values)



> I really really strongly believe you should have used R-help
> instead of R-devel.
>
> Best,
> Martin Maechler
>

If you continue to believe with the inputs given in this reply that this
should be on R-help, we will switch over.

Your continued help would be appreciated in understanding the issue.

T

	[[alternative HTML version deleted]]



More information about the R-devel mailing list