[Rd] Bias in R's random integers?

Fri Sep 21 04:58:07 CEST 2018

The same issue occurs in walker_ProbSampleReplace() in random.c, lines
386-387:

rU = unif_rand() * n;
k = (int) rU;

Cheers,
Philip

On Wed, Sep 19, 2018 at 3:08 PM Duncan Murdoch <murdoch.duncan using gmail.com>
wrote:

> On 19/09/2018 5:57 PM, David Hugh-Jones wrote:
> >
> > It doesn't seem too hard to come up with plausible ways in which this
> > could give bad results. Suppose I sample rows from a large dataset,
> > maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g.
> > odd rows are males, even rows are females. Oops! Very non-representative
> > sample, bootstrap p values are garbage.
>
> That would only happen if your dataset was exactly 1717986918 elements
> in size. (And in fact, it will be less extreme than I posted:  I had x
> set to 1717986918.4, as described in another thread.  If you use an
> integer value you need a different pattern; add or subtract an element
> or two and the pattern needed to see a problem changes drastically.)
>
> But if you're sampling from a dataset of that exact size, then you
> should worry about this bug. Don't use sample().  Use the algorithm that
> Carl described.
>
> Duncan Murdoch
>
> >
> > David
> >
> > On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>> wrote:
> >
> >     On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> >      > Hi Duncan--
> >      >
> >      >
> >
> >     That is a mathematically true statement, but I suspect it is not very
> >     relevant.  Pseudo-random number generators always have test functions
> >     whose sample averages are quite different from the expectation under
> >     the
> >     true distribution.  Remember Von Neumann's "state of sin" quote.  The
> >     bug in sample() just means it is easier to find such a function than
> it
> >     would otherwise be.
> >
> >     The practical question is whether such a function is likely to arise
> in
> >     practice or not.
> >
> >       > Whether those correspond to commonly used statistics or not, I
> >     have no
> >       > idea.
> >
> >     I am pretty confident that this bug rarely matters.
> >
> >      > Regarding backwards compatibility: as a user, I'd rather the
> default
> >      > sample() do the best possible thing, and take an extra step to use
> >      > something like sample(..., legacy=TRUE) if I want to reproduce
> >     old results.
> >
> >     I suspect there's a good chance the bug I discovered today
> (non-integer
> >     x values not being truncated) will be declared to be a feature, and
> the
> >     documentation will be changed.  Then the rejection sampling approach
> >     would need to be quite a bit more complicated.
> >
> >     I think a documentation warning about the accuracy of sampling
> >     probabilities would also be a sufficient fix here, and would be
> quite a
> >     bit less trouble than changing the default sample().  But as I said
> in
> >     my original post, a contribution of a function without this bug
> >     would be
> >     a nice addition.
> >
> >     Duncan Murdoch
> >
> >      >
> >      > Regards,
> >      > Philip
> >      >
> >      > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
> >     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> >      > <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>> wrote:
> >      >
> >      >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >      >      > No, the 2nd call only happens when m > 2**31. Here's the
> code:
> >      >
> >      >     Yes, you're right. Sorry!
> >      >
> >      >     So the ratio really does come close to 2.  However, the
> >     difference in
> >      >     probabilities between outcomes is still at most 2^-32 when m
> >     is less
> >      >     than that cutoff.  That's not feasible to detect; the only
> >     detectable
> >      >     difference would happen if some event was constructed to hold
> an
> >      >     abundance of outcomes with especially low (or especially high)
> >      >     probability.
> >      >
> >      >     As I said in my original post, it's probably not hard to
> >     construct such
> >      >     a thing, but as I've said more recently, it probably wouldn't
> >     happen by
> >      >     chance.  Here's one attempt to do it:
> >      >
> >      >     Call the values from unif_rand() "the unif_rand() outcomes".
> >     Call the
> >      >     values from sample() the sample outcomes.
> >      >
> >      >     It would be easiest to see the error if half of the sample()
> >     outcomes
> >      >     used two unif_rand() outcomes, and half used just one.  That
> >     would mean
> >      >     m should be (2/3) * 2^32, but that's too big and would
> >     trigger the
> >      >     other
> >      >     version.
> >      >
> >      >     So how about half use 2 unif_rands(), and half use 3?  That
> >     means m =
> >      >     (2/5) * 2^32 = 1717986918.  A good guess is that sample()
> >     outcomes
> >      >     would
> >      >     alternate between the two possibilities, so our event could
> >     be even
> >      >     versus odd outcomes.
> >      >
> >      >     Let's try it:
> >      >
> >      >       > m <- (2/5)*2^32
> >      >       > m > 2^31
> >      >     [1] FALSE
> >      >       > x <- sample(m, 1000000, replace = TRUE)
> >      >       > table(x %% 2)
> >      >
> >      >            0      1
> >      >     399850 600150
> >      >
> >      >     Since m is an even number, the true proportions of evens and
> odds
> >      >     should
> >      >     be exactly 0.5.  That's some pretty strong evidence of the
> >     bug in the
> >      >     generator.  (Note that the ratio of the observed
> >     probabilities is about
> >      >     1.5, so I may not be the first person to have done this.)
> >      >
> >      >     I'm still not convinced that there has ever been a simulation
> >     run with
> >      >     detectable bias compared to Monte Carlo error unless it (like
> >     this one)
> >      >     was designed specifically to show the problem.
> >      >
> >      >     Duncan Murdoch
> >      >
> >      >      >
> >      >      > (RNG.c, lines 793ff)
> >      >      >
> >      >      > double R_unif_index(double dn)
> >      >      > {
> >      >      >      double cut = INT_MAX;
> >      >      >
> >      >      >      switch(RNG_kind) {
> >      >      >      case KNUTH_TAOCP:
> >      >      >      case USER_UNIF:
> >      >      >      case KNUTH_TAOCP2:
> >      >      > cut = 33554431.0; /* 2^25 - 1 */
> >      >      > break;
> >      >      >      default:
> >      >      > break;
> >      >      >     }
> >      >      >
> >      >      >      double u = dn > cut ? ru() : unif_rand();
> >      >      >      return floor(dn * u);
> >      >      > }
> >      >      >
> >      >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> >      >     <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> >     <mailto:murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>
> >      >      > <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>>> wrote:
> >      >      >
> >      >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >      >      >      > The 53 bits only encode at most 2^{32} possible
> values,
> >      >     because the
> >      >      >      > source of the float is the output of a 32-bit PRNG
> (the
> >      >     obsolete
> >      >      >     version
> >      >      >      > of MT). 53 bits isn't the relevant number here.
> >      >      >
> >      >      >     No, two calls to unif_rand() are used.  There are two
> >     32 bit
> >      >     values,
> >      >      >     but
> >      >      >     some of the bits are thrown away.
> >      >      >
> >      >      >     Duncan Murdoch
> >      >      >
> >      >      >      >
> >      >      >      > The selection ratios can get close to 2. Computer
> >     scientists
> >      >      >     don't do it
> >      >      >      > the way R does, for a reason.
> >      >      >      >
> >      >      >      > Regards,
> >      >      >      > Philip
> >      >      >      >
> >      >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >      >      >     <murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>>
> >      >      >      > <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>
> >      >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>>>> wrote:
> >      >      >      >
> >      >      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> >      >      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan
> >     Murdoch
> >      >      >      >      > (<murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>
> >      >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>> <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>
> >      >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>
> >      >     <mailto:murdoch.duncan using gmail.com
> >     <mailto:murdoch.duncan using gmail.com>>>>>)
> >      >      >      >     escribió:
> >      >      >      >      >>
> >      >      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >      >      >      >      >>> Dear list,
> >      >      >      >      >>>
> >      >      >      >      >>> It looks to me that R samples random
> integers
> >      >     using an
> >      >      >      >     intuitive but biased
> >      >      >      >      >>> algorithm by going from a random number on
> >     [0,1) from
> >      >      >     the PRNG
> >      >      >      >     to a random
> >      >      >      >      >>> integer, e.g.
> >      >      >      >      >>>
> >      >      >      >
> >      >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >      >      >      >>>
> >      >      >      >      >>> Many other languages use various rejection
> >     sampling
> >      >      >     approaches
> >      >      >      >     which
> >      >      >      >      >>> provide an unbiased method for sampling,
> >     such as
> >      >     in Go,
> >      >      >     python,
> >      >      >      >     and others
> >      >      >      >      >>> described here:
> >     https://arxiv.org/abs/1805.10941 (I
> >      >      >     believe the
> >      >      >      >     biased
> >      >      >      >      >>> algorithm currently used in R is also
> >     described
> >      >     there).  I'm
> >      >      >      >     not an expert
> >      >      >      >      >>> in this area, but does it make sense for
> >     the R to
> >      >     adopt
> >      >      >     one of
> >      >      >      >     the unbiased
> >      >      >      >      >>> random sample algorithms outlined there
> >     and used
> >      >     in other
> >      >      >      >     languages?  Would
> >      >      >      >      >>> a patch providing such an algorithm be
> >     welcome? What
> >      >      >     concerns
> >      >      >      >     would need to
> >      >      >      >      >>> be addressed first?
> >      >      >      >      >>>
> >      >      >      >      >>> I believe this issue was also raised by
> >     Killie &
> >      >     Philip in
> >      >      >      >      >>>
> >      >      >
> >     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >      >      >      >     more
> >      >      >      >      >>> recently in
> >      >      >      >      >>>
> >      >      >      >
> >      >      >
> >      >
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >      >
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >      >      >
> >      >      >
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >      >      >      >>> pointing to the python implementation for
> >     comparison:
> >      >      >      >      >>>
> >      >      >      >
> >      >      >
> >      >
> >
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >      >      >      >>
> >      >      >      >      >> I think the analyses are correct, but I
> >     doubt if a
> >      >     change
> >      >      >     to the
> >      >      >      >     default
> >      >      >      >      >> is likely to be accepted as it would make
> >     it more
> >      >      >     difficult to
> >      >      >      >     reproduce
> >      >      >      >      >> older results.
> >      >      >      >      >>
> >      >      >      >      >> On the other hand, a contribution of a new
> >      >     function like
> >      >      >      >     sample() but
> >      >      >      >      >> not suffering from the bias would be good.
> The
> >      >     normal way to
> >      >      >      >     make such
> >      >      >      >      >> a contribution is in a user contributed
> >     package.
> >      >      >      >      >>
> >      >      >      >      >> By the way, R code illustrating the bias is
> >      >     probably not very
> >      >      >      >     hard to
> >      >      >      >      >> put together.  I believe the bias manifests
> >     itself in
> >      >      >     sample()
> >      >      >      >     producing
> >      >      >      >      >> values with two different probabilities
> >     (instead
> >      >     of all equal
> >      >      >      >      >> probabilities).  Those may differ by as
> much as
> >      >     one part in
> >      >      >      >     2^32.  It's
> >      >      >      >      >
> >      >      >      >      > According to Kellie and Philip, in the
> >     attachment
> >      >     of the
> >      >      >     thread
> >      >      >      >      > referenced by Carl, "The maximum ratio of
> >     selection
> >      >      >     probabilities can
> >      >      >      >      > get as large as 1.5 if n is just below 2^31".
> >      >      >      >
> >      >      >      >     Sorry, I didn't write very well.  I meant to
> >     say that the
> >      >      >     difference in
> >      >      >      >     probabilities would be 2^-32, not that the
> ratio of
> >      >      >     probabilities would
> >      >      >      >     be 1 + 2^-32.
> >      >      >      >
> >      >      >      >     By the way, I don't see the statement giving
> >     the ratio as
> >      >      >     1.5, but
> >      >      >      >     maybe
> >      >      >      >     I was looking in the wrong place.  In Theorem 1
> >     of the
> >      >     paper
> >      >      >     I was
> >      >      >      >     looking in the ratio was "1 + m 2^{-w + 1}".
> >     In that
> >      >     formula
> >      >      >     m is your
> >      >      >      >     n.  If it is near 2^31, R uses w = 57 random
> >     bits, so
> >      >     the ratio
> >      >      >      >     would be
> >      >      >      >     very, very small (one part in 2^25).
> >      >      >      >
> >      >      >      >     The worst case for R would happen when m  is
> >     just below
> >      >      >     2^25, where w
> >      >      >      >     is at least 31 for the default generators.  In
> that
> >      >     case the
> >      >      >     ratio
> >      >      >      >     could
> >      >      >      >     be about 1.03.
> >      >      >      >
> >      >      >      >     Duncan Murdoch
> >      >      >      >
> >      >      >      >
> >      >      >      >
> >      >      >      > --
> >      >      >      > Philip B. Stark | Associate Dean, Mathematical and
> >     Physical
> >      >      >     Sciences |
> >      >      >      > Professor,  Department of Statistics |
> >      >      >      > University of California
> >      >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >      >      > statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      >      > <http://statistics.berkeley.edu/%7Estark> |
> >      >      >      > @philipbstark
> >      >      >      >
> >      >      >
> >      >      >
> >      >      >
> >      >      > --
> >      >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >      >     Sciences |
> >      >      > Professor,  Department of Statistics |
> >      >      > University of California
> >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >      > statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      > <http://statistics.berkeley.edu/%7Estark> |
> >      >      > @philipbstark
> >      >      >
> >      >
> >      >
> >      >
> >      > --
> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >     Sciences |
> >      > Professor,  Department of Statistics |
> >      > University of California
> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >     statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      > <http://statistics.berkeley.edu/%7Estark> |
> >      > @philipbstark
> >      >
> >
> >     ______________________________________________
> >     R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
> >     https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Sent from Gmail Mobile
>
>

-- 
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

	[[alternative HTML version deleted]]