[Rd] Bias in R's random integers?

Philip B. Stark @t@rk @ending from @t@t@berkeley@edu
Wed Sep 19 21:58:12 CEST 2018


One more thing, apropos this:

I'm still not convinced that there has ever been a simulation run with
detectable bias compared to Monte Carlo error unless it (like this one) was
designed specifically to show the problem.


I often use random permutations to simulate p-values to calibrate
permutation tests. If I'm trying to simulate the probability of a
low-probability event, this could matter a lot.

Best wishes,
Philip

On Wed, Sep 19, 2018 at 12:52 PM Philip B. Stark <stark using stat.berkeley.edu>
wrote:

> Hi Duncan--
>
> Nice simulation!
>
> The absolute difference in probabilities is small, but the maximum
> relative difference grows from something negligible to almost 2 as m
> approaches 2**31.
>
> Because the L_1 distance between the uniform distribution on {1, ..., m}
> and what you actually get is large, there have to be test functions whose
> expectations are quite different under the two distributions. Whether those
> correspond to commonly used statistics or not, I have no idea.
>
> Regarding backwards compatibility: as a user, I'd rather the default
> sample() do the best possible thing, and take an extra step to use
> something like sample(..., legacy=TRUE) if I want to reproduce old results.
>
> Regards,
> Philip
>
> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan using gmail.com>
> wrote:
>
>> On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>> > No, the 2nd call only happens when m > 2**31. Here's the code:
>>
>> Yes, you're right. Sorry!
>>
>> So the ratio really does come close to 2.  However, the difference in
>> probabilities between outcomes is still at most 2^-32 when m is less
>> than that cutoff.  That's not feasible to detect; the only detectable
>> difference would happen if some event was constructed to hold an
>> abundance of outcomes with especially low (or especially high)
>> probability.
>>
>> As I said in my original post, it's probably not hard to construct such
>> a thing, but as I've said more recently, it probably wouldn't happen by
>> chance.  Here's one attempt to do it:
>>
>> Call the values from unif_rand() "the unif_rand() outcomes".  Call the
>> values from sample() the sample outcomes.
>>
>> It would be easiest to see the error if half of the sample() outcomes
>> used two unif_rand() outcomes, and half used just one.  That would mean
>> m should be (2/3) * 2^32, but that's too big and would trigger the other
>> version.
>>
>> So how about half use 2 unif_rands(), and half use 3?  That means m =
>> (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes would
>> alternate between the two possibilities, so our event could be even
>> versus odd outcomes.
>>
>> Let's try it:
>>
>>  > m <- (2/5)*2^32
>>  > m > 2^31
>> [1] FALSE
>>  > x <- sample(m, 1000000, replace = TRUE)
>>  > table(x %% 2)
>>
>>       0      1
>> 399850 600150
>>
>> Since m is an even number, the true proportions of evens and odds should
>> be exactly 0.5.  That's some pretty strong evidence of the bug in the
>> generator.  (Note that the ratio of the observed probabilities is about
>> 1.5, so I may not be the first person to have done this.)
>>
>> I'm still not convinced that there has ever been a simulation run with
>> detectable bias compared to Monte Carlo error unless it (like this one)
>> was designed specifically to show the problem.
>>
>> Duncan Murdoch
>>
>> >
>> > (RNG.c, lines 793ff)
>> >
>> > double R_unif_index(double dn)
>> > {
>> >      double cut = INT_MAX;
>> >
>> >      switch(RNG_kind) {
>> >      case KNUTH_TAOCP:
>> >      case USER_UNIF:
>> >      case KNUTH_TAOCP2:
>> > cut = 33554431.0; /* 2^25 - 1 */
>> > break;
>> >      default:
>> > break;
>> >     }
>> >
>> >      double u = dn > cut ? ru() : unif_rand();
>> >      return floor(dn * u);
>> > }
>> >
>> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <
>> murdoch.duncan using gmail.com
>> > <mailto:murdoch.duncan using gmail.com>> wrote:
>> >
>> >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>> >      > The 53 bits only encode at most 2^{32} possible values, because
>> the
>> >      > source of the float is the output of a 32-bit PRNG (the obsolete
>> >     version
>> >      > of MT). 53 bits isn't the relevant number here.
>> >
>> >     No, two calls to unif_rand() are used.  There are two 32 bit values,
>> >     but
>> >     some of the bits are thrown away.
>> >
>> >     Duncan Murdoch
>> >
>> >      >
>> >      > The selection ratios can get close to 2. Computer scientists
>> >     don't do it
>> >      > the way R does, for a reason.
>> >      >
>> >      > Regards,
>> >      > Philip
>> >      >
>> >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>> >     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
>> >      > <mailto:murdoch.duncan using gmail.com
>> >     <mailto:murdoch.duncan using gmail.com>>> wrote:
>> >      >
>> >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>> >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>> >      >      > (<murdoch.duncan using gmail.com
>> >     <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
>> >     <mailto:murdoch.duncan using gmail.com>>>)
>> >      >     escribió:
>> >      >      >>
>> >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>> >      >      >>> Dear list,
>> >      >      >>>
>> >      >      >>> It looks to me that R samples random integers using an
>> >      >     intuitive but biased
>> >      >      >>> algorithm by going from a random number on [0,1) from
>> >     the PRNG
>> >      >     to a random
>> >      >      >>> integer, e.g.
>> >      >      >>>
>> >      >
>> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>> >      >      >>>
>> >      >      >>> Many other languages use various rejection sampling
>> >     approaches
>> >      >     which
>> >      >      >>> provide an unbiased method for sampling, such as in Go,
>> >     python,
>> >      >     and others
>> >      >      >>> described here: https://arxiv.org/abs/1805.10941 (I
>> >     believe the
>> >      >     biased
>> >      >      >>> algorithm currently used in R is also described
>> there).  I'm
>> >      >     not an expert
>> >      >      >>> in this area, but does it make sense for the R to adopt
>> >     one of
>> >      >     the unbiased
>> >      >      >>> random sample algorithms outlined there and used in
>> other
>> >      >     languages?  Would
>> >      >      >>> a patch providing such an algorithm be welcome? What
>> >     concerns
>> >      >     would need to
>> >      >      >>> be addressed first?
>> >      >      >>>
>> >      >      >>> I believe this issue was also raised by Killie & Philip
>> in
>> >      >      >>>
>> >     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>> >      >     more
>> >      >      >>> recently in
>> >      >      >>>
>> >      >
>> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>> >     <
>> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>> >      >
>> >       <
>> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>> >      >      >>> pointing to the python implementation for comparison:
>> >      >      >>>
>> >      >
>> >
>> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>> >      >      >>
>> >      >      >> I think the analyses are correct, but I doubt if a change
>> >     to the
>> >      >     default
>> >      >      >> is likely to be accepted as it would make it more
>> >     difficult to
>> >      >     reproduce
>> >      >      >> older results.
>> >      >      >>
>> >      >      >> On the other hand, a contribution of a new function like
>> >      >     sample() but
>> >      >      >> not suffering from the bias would be good.  The normal
>> way to
>> >      >     make such
>> >      >      >> a contribution is in a user contributed package.
>> >      >      >>
>> >      >      >> By the way, R code illustrating the bias is probably not
>> very
>> >      >     hard to
>> >      >      >> put together.  I believe the bias manifests itself in
>> >     sample()
>> >      >     producing
>> >      >      >> values with two different probabilities (instead of all
>> equal
>> >      >      >> probabilities).  Those may differ by as much as one part
>> in
>> >      >     2^32.  It's
>> >      >      >
>> >      >      > According to Kellie and Philip, in the attachment of the
>> >     thread
>> >      >      > referenced by Carl, "The maximum ratio of selection
>> >     probabilities can
>> >      >      > get as large as 1.5 if n is just below 2^31".
>> >      >
>> >      >     Sorry, I didn't write very well.  I meant to say that the
>> >     difference in
>> >      >     probabilities would be 2^-32, not that the ratio of
>> >     probabilities would
>> >      >     be 1 + 2^-32.
>> >      >
>> >      >     By the way, I don't see the statement giving the ratio as
>> >     1.5, but
>> >      >     maybe
>> >      >     I was looking in the wrong place.  In Theorem 1 of the paper
>> >     I was
>> >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that formula
>> >     m is your
>> >      >     n.  If it is near 2^31, R uses w = 57 random bits, so the
>> ratio
>> >      >     would be
>> >      >     very, very small (one part in 2^25).
>> >      >
>> >      >     The worst case for R would happen when m  is just below
>> >     2^25, where w
>> >      >     is at least 31 for the default generators.  In that case the
>> >     ratio
>> >      >     could
>> >      >     be about 1.03.
>> >      >
>> >      >     Duncan Murdoch
>> >      >
>> >      >
>> >      >
>> >      > --
>> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
>> >     Sciences |
>> >      > Professor,  Department of Statistics |
>> >      > University of California
>> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>> >     statistics.berkeley.edu/~stark
>> >     <http://statistics.berkeley.edu/%7Estark>
>> >      > <http://statistics.berkeley.edu/%7Estark> |
>> >      > @philipbstark
>> >      >
>> >
>> >
>> >
>> > --
>> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
>> > Professor,  Department of Statistics |
>> > University of California
>> > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
>> > <http://statistics.berkeley.edu/%7Estark> |
>> > @philipbstark
>> >
>>
>>
>
> --
> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> Professor,  Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
> @philipbstark
>
>

-- 
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

	[[alternative HTML version deleted]]



More information about the R-devel mailing list