[Rd] Bias in R's random integers?

Wed Sep 19 18:20:38 CEST 2018

On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> The 53 bits only encode at most 2^{32} possible values, because the 
> source of the float is the output of a 32-bit PRNG (the obsolete version 
> of MT). 53 bits isn't the relevant number here.

No, two calls to unif_rand() are used.  There are two 32 bit values, but 
some of the bits are thrown away.

Duncan Murdoch

> 
> The selection ratios can get close to 2. Computer scientists don't do it 
> the way R does, for a reason.
> 
> Regards,
> Philip
> 
> On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan using gmail.com 
> <mailto:murdoch.duncan using gmail.com>> wrote:
> 
>     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>      > (<murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>)
>     escribió:
>      >>
>      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>      >>> Dear list,
>      >>>
>      >>> It looks to me that R samples random integers using an
>     intuitive but biased
>      >>> algorithm by going from a random number on [0,1) from the PRNG
>     to a random
>      >>> integer, e.g.
>      >>>
>     https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>      >>>
>      >>> Many other languages use various rejection sampling approaches
>     which
>      >>> provide an unbiased method for sampling, such as in Go, python,
>     and others
>      >>> described here: https://arxiv.org/abs/1805.10941 (I believe the
>     biased
>      >>> algorithm currently used in R is also described there).  I'm
>     not an expert
>      >>> in this area, but does it make sense for the R to adopt one of
>     the unbiased
>      >>> random sample algorithms outlined there and used in other
>     languages?  Would
>      >>> a patch providing such an algorithm be welcome? What concerns
>     would need to
>      >>> be addressed first?
>      >>>
>      >>> I believe this issue was also raised by Killie & Philip in
>      >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>     more
>      >>> recently in
>      >>>
>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>     <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>      >>> pointing to the python implementation for comparison:
>      >>>
>     https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>      >>
>      >> I think the analyses are correct, but I doubt if a change to the
>     default
>      >> is likely to be accepted as it would make it more difficult to
>     reproduce
>      >> older results.
>      >>
>      >> On the other hand, a contribution of a new function like
>     sample() but
>      >> not suffering from the bias would be good.  The normal way to
>     make such
>      >> a contribution is in a user contributed package.
>      >>
>      >> By the way, R code illustrating the bias is probably not very
>     hard to
>      >> put together.  I believe the bias manifests itself in sample()
>     producing
>      >> values with two different probabilities (instead of all equal
>      >> probabilities).  Those may differ by as much as one part in
>     2^32.  It's
>      >
>      > According to Kellie and Philip, in the attachment of the thread
>      > referenced by Carl, "The maximum ratio of selection probabilities can
>      > get as large as 1.5 if n is just below 2^31".
> 
>     Sorry, I didn't write very well.  I meant to say that the difference in
>     probabilities would be 2^-32, not that the ratio of probabilities would
>     be 1 + 2^-32.
> 
>     By the way, I don't see the statement giving the ratio as 1.5, but
>     maybe
>     I was looking in the wrong place.  In Theorem 1 of the paper I was
>     looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m is your
>     n.  If it is near 2^31, R uses w = 57 random bits, so the ratio
>     would be
>     very, very small (one part in 2^25).
> 
>     The worst case for R would happen when m  is just below  2^25, where w
>     is at least 31 for the default generators.  In that case the ratio
>     could
>     be about 1.03.
> 
>     Duncan Murdoch
> 
> 
> 
> -- 
> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | 
> Professor,  Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark 
> <http://statistics.berkeley.edu/%7Estark> |
> @philipbstark
>