[Rd] Bias in R's random integers?
Duncan Murdoch
murdoch@dunc@n @ending from gm@il@com
Wed Sep 19 18:05:33 CEST 2018
On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> (<murdoch.duncan using gmail.com>) escribió:
>>
>> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>>> Dear list,
>>>
>>> It looks to me that R samples random integers using an intuitive but biased
>>> algorithm by going from a random number on [0,1) from the PRNG to a random
>>> integer, e.g.
>>> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>>>
>>> Many other languages use various rejection sampling approaches which
>>> provide an unbiased method for sampling, such as in Go, python, and others
>>> described here: https://arxiv.org/abs/1805.10941 (I believe the biased
>>> algorithm currently used in R is also described there). I'm not an expert
>>> in this area, but does it make sense for the R to adopt one of the unbiased
>>> random sample algorithms outlined there and used in other languages? Would
>>> a patch providing such an algorithm be welcome? What concerns would need to
>>> be addressed first?
>>>
>>> I believe this issue was also raised by Killie & Philip in
>>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more
>>> recently in
>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
>>> pointing to the python implementation for comparison:
>>> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>>
>> I think the analyses are correct, but I doubt if a change to the default
>> is likely to be accepted as it would make it more difficult to reproduce
>> older results.
>>
>> On the other hand, a contribution of a new function like sample() but
>> not suffering from the bias would be good. The normal way to make such
>> a contribution is in a user contributed package.
>>
>> By the way, R code illustrating the bias is probably not very hard to
>> put together. I believe the bias manifests itself in sample() producing
>> values with two different probabilities (instead of all equal
>> probabilities). Those may differ by as much as one part in 2^32. It's
>
> According to Kellie and Philip, in the attachment of the thread
> referenced by Carl, "The maximum ratio of selection probabilities can
> get as large as 1.5 if n is just below 2^31".
Sorry, I didn't write very well. I meant to say that the difference in
probabilities would be 2^-32, not that the ratio of probabilities would
be 1 + 2^-32.
By the way, I don't see the statement giving the ratio as 1.5, but maybe
I was looking in the wrong place. In Theorem 1 of the paper I was
looking in the ratio was "1 + m 2^{-w + 1}". In that formula m is your
n. If it is near 2^31, R uses w = 57 random bits, so the ratio would be
very, very small (one part in 2^25).
The worst case for R would happen when m is just below 2^25, where w
is at least 31 for the default generators. In that case the ratio could
be about 1.03.
Duncan Murdoch
More information about the R-devel
mailing list