[Rd] Bias in R's random integers?
Philip B. Stark
@t@rk @ending from @t@t@berkeley@edu
Fri Sep 21 04:58:07 CEST 2018
The same issue occurs in walker_ProbSampleReplace() in random.c, lines
386-387:
rU = unif_rand() * n;
k = (int) rU;
Cheers,
Philip
On Wed, Sep 19, 2018 at 3:08 PM Duncan Murdoch <murdoch.duncan using gmail.com>
wrote:
> On 19/09/2018 5:57 PM, David Hugh-Jones wrote:
> >
> > It doesn't seem too hard to come up with plausible ways in which this
> > could give bad results. Suppose I sample rows from a large dataset,
> > maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g.
> > odd rows are males, even rows are females. Oops! Very non-representative
> > sample, bootstrap p values are garbage.
>
> That would only happen if your dataset was exactly 1717986918 elements
> in size. (And in fact, it will be less extreme than I posted: I had x
> set to 1717986918.4, as described in another thread. If you use an
> integer value you need a different pattern; add or subtract an element
> or two and the pattern needed to see a problem changes drastically.)
>
> But if you're sampling from a dataset of that exact size, then you
> should worry about this bug. Don't use sample(). Use the algorithm that
> Carl described.
>
> Duncan Murdoch
>
> >
> > David
> >
> > On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>> wrote:
> >
> > On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> > > Hi Duncan--
> > >
> > >
> >
> > That is a mathematically true statement, but I suspect it is not very
> > relevant. Pseudo-random number generators always have test functions
> > whose sample averages are quite different from the expectation under
> > the
> > true distribution. Remember Von Neumann's "state of sin" quote. The
> > bug in sample() just means it is easier to find such a function than
> it
> > would otherwise be.
> >
> > The practical question is whether such a function is likely to arise
> in
> > practice or not.
> >
> > > Whether those correspond to commonly used statistics or not, I
> > have no
> > > idea.
> >
> > I am pretty confident that this bug rarely matters.
> >
> > > Regarding backwards compatibility: as a user, I'd rather the
> default
> > > sample() do the best possible thing, and take an extra step to use
> > > something like sample(..., legacy=TRUE) if I want to reproduce
> > old results.
> >
> > I suspect there's a good chance the bug I discovered today
> (non-integer
> > x values not being truncated) will be declared to be a feature, and
> the
> > documentation will be changed. Then the rejection sampling approach
> > would need to be quite a bit more complicated.
> >
> > I think a documentation warning about the accuracy of sampling
> > probabilities would also be a sufficient fix here, and would be
> quite a
> > bit less trouble than changing the default sample(). But as I said
> in
> > my original post, a contribution of a function without this bug
> > would be
> > a nice addition.
> >
> > Duncan Murdoch
> >
> > >
> > > Regards,
> > > Philip
> > >
> > > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
> > <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>> wrote:
> > >
> > > On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> > > > No, the 2nd call only happens when m > 2**31. Here's the
> code:
> > >
> > > Yes, you're right. Sorry!
> > >
> > > So the ratio really does come close to 2. However, the
> > difference in
> > > probabilities between outcomes is still at most 2^-32 when m
> > is less
> > > than that cutoff. That's not feasible to detect; the only
> > detectable
> > > difference would happen if some event was constructed to hold
> an
> > > abundance of outcomes with especially low (or especially high)
> > > probability.
> > >
> > > As I said in my original post, it's probably not hard to
> > construct such
> > > a thing, but as I've said more recently, it probably wouldn't
> > happen by
> > > chance. Here's one attempt to do it:
> > >
> > > Call the values from unif_rand() "the unif_rand() outcomes".
> > Call the
> > > values from sample() the sample outcomes.
> > >
> > > It would be easiest to see the error if half of the sample()
> > outcomes
> > > used two unif_rand() outcomes, and half used just one. That
> > would mean
> > > m should be (2/3) * 2^32, but that's too big and would
> > trigger the
> > > other
> > > version.
> > >
> > > So how about half use 2 unif_rands(), and half use 3? That
> > means m =
> > > (2/5) * 2^32 = 1717986918. A good guess is that sample()
> > outcomes
> > > would
> > > alternate between the two possibilities, so our event could
> > be even
> > > versus odd outcomes.
> > >
> > > Let's try it:
> > >
> > > > m <- (2/5)*2^32
> > > > m > 2^31
> > > [1] FALSE
> > > > x <- sample(m, 1000000, replace = TRUE)
> > > > table(x %% 2)
> > >
> > > 0 1
> > > 399850 600150
> > >
> > > Since m is an even number, the true proportions of evens and
> odds
> > > should
> > > be exactly 0.5. That's some pretty strong evidence of the
> > bug in the
> > > generator. (Note that the ratio of the observed
> > probabilities is about
> > > 1.5, so I may not be the first person to have done this.)
> > >
> > > I'm still not convinced that there has ever been a simulation
> > run with
> > > detectable bias compared to Monte Carlo error unless it (like
> > this one)
> > > was designed specifically to show the problem.
> > >
> > > Duncan Murdoch
> > >
> > > >
> > > > (RNG.c, lines 793ff)
> > > >
> > > > double R_unif_index(double dn)
> > > > {
> > > > double cut = INT_MAX;
> > > >
> > > > switch(RNG_kind) {
> > > > case KNUTH_TAOCP:
> > > > case USER_UNIF:
> > > > case KNUTH_TAOCP2:
> > > > cut = 33554431.0; /* 2^25 - 1 */
> > > > break;
> > > > default:
> > > > break;
> > > > }
> > > >
> > > > double u = dn > cut ? ru() : unif_rand();
> > > > return floor(dn * u);
> > > > }
> > > >
> > > > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> > > <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>
> > > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>>> wrote:
> > > >
> > > > On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> > > > > The 53 bits only encode at most 2^{32} possible
> values,
> > > because the
> > > > > source of the float is the output of a 32-bit PRNG
> (the
> > > obsolete
> > > > version
> > > > > of MT). 53 bits isn't the relevant number here.
> > > >
> > > > No, two calls to unif_rand() are used. There are two
> > 32 bit
> > > values,
> > > > but
> > > > some of the bits are thrown away.
> > > >
> > > > Duncan Murdoch
> > > >
> > > > >
> > > > > The selection ratios can get close to 2. Computer
> > scientists
> > > > don't do it
> > > > > the way R does, for a reason.
> > > > >
> > > > > Regards,
> > > > > Philip
> > > > >
> > > > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> > > > <murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>>
> > > > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>
> > > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>>>> wrote:
> > > > >
> > > > > On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> > > > > > El mié., 19 sept. 2018 a las 14:43, Duncan
> > Murdoch
> > > > > > (<murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>
> > > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>> <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>
> > > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>
> > > <mailto:murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>>>>>)
> > > > > escribió:
> > > > > >>
> > > > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> > > > > >>> Dear list,
> > > > > >>>
> > > > > >>> It looks to me that R samples random
> integers
> > > using an
> > > > > intuitive but biased
> > > > > >>> algorithm by going from a random number on
> > [0,1) from
> > > > the PRNG
> > > > > to a random
> > > > > >>> integer, e.g.
> > > > > >>>
> > > > >
> > >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> > > > > >>>
> > > > > >>> Many other languages use various rejection
> > sampling
> > > > approaches
> > > > > which
> > > > > >>> provide an unbiased method for sampling,
> > such as
> > > in Go,
> > > > python,
> > > > > and others
> > > > > >>> described here:
> > https://arxiv.org/abs/1805.10941 (I
> > > > believe the
> > > > > biased
> > > > > >>> algorithm currently used in R is also
> > described
> > > there). I'm
> > > > > not an expert
> > > > > >>> in this area, but does it make sense for
> > the R to
> > > adopt
> > > > one of
> > > > > the unbiased
> > > > > >>> random sample algorithms outlined there
> > and used
> > > in other
> > > > > languages? Would
> > > > > >>> a patch providing such an algorithm be
> > welcome? What
> > > > concerns
> > > > > would need to
> > > > > >>> be addressed first?
> > > > > >>>
> > > > > >>> I believe this issue was also raised by
> > Killie &
> > > Philip in
> > > > > >>>
> > > >
> > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> > > > > more
> > > > > >>> recently in
> > > > > >>>
> > > > >
> > > >
> > >
> > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> > <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> > >
> > <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> > > >
> > >
> > <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> > > > >
> > > >
> > >
> > <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> > > > > >>> pointing to the python implementation for
> > comparison:
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> > > > > >>
> > > > > >> I think the analyses are correct, but I
> > doubt if a
> > > change
> > > > to the
> > > > > default
> > > > > >> is likely to be accepted as it would make
> > it more
> > > > difficult to
> > > > > reproduce
> > > > > >> older results.
> > > > > >>
> > > > > >> On the other hand, a contribution of a new
> > > function like
> > > > > sample() but
> > > > > >> not suffering from the bias would be good.
> The
> > > normal way to
> > > > > make such
> > > > > >> a contribution is in a user contributed
> > package.
> > > > > >>
> > > > > >> By the way, R code illustrating the bias is
> > > probably not very
> > > > > hard to
> > > > > >> put together. I believe the bias manifests
> > itself in
> > > > sample()
> > > > > producing
> > > > > >> values with two different probabilities
> > (instead
> > > of all equal
> > > > > >> probabilities). Those may differ by as
> much as
> > > one part in
> > > > > 2^32. It's
> > > > > >
> > > > > > According to Kellie and Philip, in the
> > attachment
> > > of the
> > > > thread
> > > > > > referenced by Carl, "The maximum ratio of
> > selection
> > > > probabilities can
> > > > > > get as large as 1.5 if n is just below 2^31".
> > > > >
> > > > > Sorry, I didn't write very well. I meant to
> > say that the
> > > > difference in
> > > > > probabilities would be 2^-32, not that the
> ratio of
> > > > probabilities would
> > > > > be 1 + 2^-32.
> > > > >
> > > > > By the way, I don't see the statement giving
> > the ratio as
> > > > 1.5, but
> > > > > maybe
> > > > > I was looking in the wrong place. In Theorem 1
> > of the
> > > paper
> > > > I was
> > > > > looking in the ratio was "1 + m 2^{-w + 1}".
> > In that
> > > formula
> > > > m is your
> > > > > n. If it is near 2^31, R uses w = 57 random
> > bits, so
> > > the ratio
> > > > > would be
> > > > > very, very small (one part in 2^25).
> > > > >
> > > > > The worst case for R would happen when m is
> > just below
> > > > 2^25, where w
> > > > > is at least 31 for the default generators. In
> that
> > > case the
> > > > ratio
> > > > > could
> > > > > be about 1.03.
> > > > >
> > > > > Duncan Murdoch
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Philip B. Stark | Associate Dean, Mathematical and
> > Physical
> > > > Sciences |
> > > > > Professor, Department of Statistics |
> > > > > University of California
> > > > > Berkeley, CA 94720-3860 | 510-394-5077 |
> > > > statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark>
> > > <http://statistics.berkeley.edu/%7Estark>
> > > > <http://statistics.berkeley.edu/%7Estark>
> > > > > <http://statistics.berkeley.edu/%7Estark> |
> > > > > @philipbstark
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Philip B. Stark | Associate Dean, Mathematical and Physical
> > > Sciences |
> > > > Professor, Department of Statistics |
> > > > University of California
> > > > Berkeley, CA 94720-3860 | 510-394-5077 |
> > > statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark>
> > > <http://statistics.berkeley.edu/%7Estark>
> > > > <http://statistics.berkeley.edu/%7Estark> |
> > > > @philipbstark
> > > >
> > >
> > >
> > >
> > > --
> > > Philip B. Stark | Associate Dean, Mathematical and Physical
> > Sciences |
> > > Professor, Department of Statistics |
> > > University of California
> > > Berkeley, CA 94720-3860 | 510-394-5077 |
> > statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark>
> > > <http://statistics.berkeley.edu/%7Estark> |
> > > @philipbstark
> > >
> >
> > ______________________________________________
> > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Sent from Gmail Mobile
>
>
--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark
[[alternative HTML version deleted]]
More information about the R-devel
mailing list