[Rd] Bias in R's random integers?

Radford Neal r@dford @ending from c@@toronto@edu
Fri Sep 21 16:15:27 CEST 2018


> Duncan Murdoch:
>
> and you can see it in the original m with
> 
>    x <- sample(m, 1000000, replace = TRUE)
>    plot(density(x[x %% 2 == 0]))

OK.  Thanks.  I see there is a real problem.

One option to fix it while mostly retaining backwards-compatibility
would be to add extra bits from a second RNG call only when m is large
- eg, larger than 2^27.  That would retain reproducibility for most
analyses of small to moderate size data sets.  Of course, there would
still be some small, detectable error for values a bit less than 2^27,
but perhaps that's tolerable.  (The 2^27 threshold could obviously be
debated.)

R Core made a similar decision in the case of sampling with
replacement when implementing a new hashing algorithm that produces
different results.  It is enabled by default only when m > 1e7 and no
more than half the values are to be sampled, as was noted:

> Note that it wouldn't be the first time that sample() changes behavior
> in a non-backward compatible way:
>
>   https://stat.ethz.ch/pipermail/r-devel/2012-October/065049.html
>
> Cheers,
> H.

That incompatibility could have been avoided.  A year ago I posted
a fast hashing algorithm that produces the same results as the simple
algorithm, here:

  https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

The latest version of this will be in the soon-to-be new release of
pqR, and will of course enabled automatically whenever it seems
desirable, for a considerable speed gain in many cases.

  Radford Neal



More information about the R-devel mailing list