[Rd] Bias in R's random integers?
Duncan Murdoch
murdoch@dunc@n @ending from gm@il@com
Thu Sep 20 00:08:28 CEST 2018
On 19/09/2018 5:57 PM, David Hugh-Jones wrote:
>
> It doesn't seem too hard to come up with plausible ways in which this
> could give bad results. Suppose I sample rows from a large dataset,
> maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g.
> odd rows are males, even rows are females. Oops! Very non-representative
> sample, bootstrap p values are garbage.
That would only happen if your dataset was exactly 1717986918 elements
in size. (And in fact, it will be less extreme than I posted: I had x
set to 1717986918.4, as described in another thread. If you use an
integer value you need a different pattern; add or subtract an element
or two and the pattern needed to see a problem changes drastically.)
But if you're sampling from a dataset of that exact size, then you
should worry about this bug. Don't use sample(). Use the algorithm that
Carl described.
Duncan Murdoch
>
> David
>
> On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>> wrote:
>
> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> > Hi Duncan--
> >
> >
>
> That is a mathematically true statement, but I suspect it is not very
> relevant. Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under
> the
> true distribution. Remember Von Neumann's "state of sin" quote. The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
>
> The practical question is whether such a function is likely to arise in
> practice or not.
>
> > Whether those correspond to commonly used statistics or not, I
> have no
> > idea.
>
> I am pretty confident that this bug rarely matters.
>
> > Regarding backwards compatibility: as a user, I'd rather the default
> > sample() do the best possible thing, and take an extra step to use
> > something like sample(..., legacy=TRUE) if I want to reproduce
> old results.
>
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed. Then the rejection sampling approach
> would need to be quite a bit more complicated.
>
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample(). But as I said in
> my original post, a contribution of a function without this bug
> would be
> a nice addition.
>
> Duncan Murdoch
>
> >
> > Regards,
> > Philip
> >
> > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
> <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>> wrote:
> >
> > On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> > > No, the 2nd call only happens when m > 2**31. Here's the code:
> >
> > Yes, you're right. Sorry!
> >
> > So the ratio really does come close to 2. However, the
> difference in
> > probabilities between outcomes is still at most 2^-32 when m
> is less
> > than that cutoff. That's not feasible to detect; the only
> detectable
> > difference would happen if some event was constructed to hold an
> > abundance of outcomes with especially low (or especially high)
> > probability.
> >
> > As I said in my original post, it's probably not hard to
> construct such
> > a thing, but as I've said more recently, it probably wouldn't
> happen by
> > chance. Here's one attempt to do it:
> >
> > Call the values from unif_rand() "the unif_rand() outcomes".
> Call the
> > values from sample() the sample outcomes.
> >
> > It would be easiest to see the error if half of the sample()
> outcomes
> > used two unif_rand() outcomes, and half used just one. That
> would mean
> > m should be (2/3) * 2^32, but that's too big and would
> trigger the
> > other
> > version.
> >
> > So how about half use 2 unif_rands(), and half use 3? That
> means m =
> > (2/5) * 2^32 = 1717986918. A good guess is that sample()
> outcomes
> > would
> > alternate between the two possibilities, so our event could
> be even
> > versus odd outcomes.
> >
> > Let's try it:
> >
> > > m <- (2/5)*2^32
> > > m > 2^31
> > [1] FALSE
> > > x <- sample(m, 1000000, replace = TRUE)
> > > table(x %% 2)
> >
> > 0 1
> > 399850 600150
> >
> > Since m is an even number, the true proportions of evens and odds
> > should
> > be exactly 0.5. That's some pretty strong evidence of the
> bug in the
> > generator. (Note that the ratio of the observed
> probabilities is about
> > 1.5, so I may not be the first person to have done this.)
> >
> > I'm still not convinced that there has ever been a simulation
> run with
> > detectable bias compared to Monte Carlo error unless it (like
> this one)
> > was designed specifically to show the problem.
> >
> > Duncan Murdoch
> >
> > >
> > > (RNG.c, lines 793ff)
> > >
> > > double R_unif_index(double dn)
> > > {
> > > double cut = INT_MAX;
> > >
> > > switch(RNG_kind) {
> > > case KNUTH_TAOCP:
> > > case USER_UNIF:
> > > case KNUTH_TAOCP2:
> > > cut = 33554431.0; /* 2^25 - 1 */
> > > break;
> > > default:
> > > break;
> > > }
> > >
> > > double u = dn > cut ? ru() : unif_rand();
> > > return floor(dn * u);
> > > }
> > >
> > > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> > <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> <mailto:murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>
> > > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>>> wrote:
> > >
> > > On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> > > > The 53 bits only encode at most 2^{32} possible values,
> > because the
> > > > source of the float is the output of a 32-bit PRNG (the
> > obsolete
> > > version
> > > > of MT). 53 bits isn't the relevant number here.
> > >
> > > No, two calls to unif_rand() are used. There are two
> 32 bit
> > values,
> > > but
> > > some of the bits are thrown away.
> > >
> > > Duncan Murdoch
> > >
> > > >
> > > > The selection ratios can get close to 2. Computer
> scientists
> > > don't do it
> > > > the way R does, for a reason.
> > > >
> > > > Regards,
> > > > Philip
> > > >
> > > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> > > <murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com> <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>>
> > > > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>
> > > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>>>> wrote:
> > > >
> > > > On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> > > > > El mié., 19 sept. 2018 a las 14:43, Duncan
> Murdoch
> > > > > (<murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>
> > > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>> <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>
> > > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>
> > <mailto:murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>>>>>)
> > > > escribió:
> > > > >>
> > > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> > > > >>> Dear list,
> > > > >>>
> > > > >>> It looks to me that R samples random integers
> > using an
> > > > intuitive but biased
> > > > >>> algorithm by going from a random number on
> [0,1) from
> > > the PRNG
> > > > to a random
> > > > >>> integer, e.g.
> > > > >>>
> > > >
> > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> > > > >>>
> > > > >>> Many other languages use various rejection
> sampling
> > > approaches
> > > > which
> > > > >>> provide an unbiased method for sampling,
> such as
> > in Go,
> > > python,
> > > > and others
> > > > >>> described here:
> https://arxiv.org/abs/1805.10941 (I
> > > believe the
> > > > biased
> > > > >>> algorithm currently used in R is also
> described
> > there). I'm
> > > > not an expert
> > > > >>> in this area, but does it make sense for
> the R to
> > adopt
> > > one of
> > > > the unbiased
> > > > >>> random sample algorithms outlined there
> and used
> > in other
> > > > languages? Would
> > > > >>> a patch providing such an algorithm be
> welcome? What
> > > concerns
> > > > would need to
> > > > >>> be addressed first?
> > > > >>>
> > > > >>> I believe this issue was also raised by
> Killie &
> > Philip in
> > > > >>>
> > >
> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> > > > more
> > > > >>> recently in
> > > > >>>
> > > >
> > >
> >
> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >
> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> > >
> >
> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> > > >
> > >
> >
> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> > > > >>> pointing to the python implementation for
> comparison:
> > > > >>>
> > > >
> > >
> >
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> > > > >>
> > > > >> I think the analyses are correct, but I
> doubt if a
> > change
> > > to the
> > > > default
> > > > >> is likely to be accepted as it would make
> it more
> > > difficult to
> > > > reproduce
> > > > >> older results.
> > > > >>
> > > > >> On the other hand, a contribution of a new
> > function like
> > > > sample() but
> > > > >> not suffering from the bias would be good. The
> > normal way to
> > > > make such
> > > > >> a contribution is in a user contributed
> package.
> > > > >>
> > > > >> By the way, R code illustrating the bias is
> > probably not very
> > > > hard to
> > > > >> put together. I believe the bias manifests
> itself in
> > > sample()
> > > > producing
> > > > >> values with two different probabilities
> (instead
> > of all equal
> > > > >> probabilities). Those may differ by as much as
> > one part in
> > > > 2^32. It's
> > > > >
> > > > > According to Kellie and Philip, in the
> attachment
> > of the
> > > thread
> > > > > referenced by Carl, "The maximum ratio of
> selection
> > > probabilities can
> > > > > get as large as 1.5 if n is just below 2^31".
> > > >
> > > > Sorry, I didn't write very well. I meant to
> say that the
> > > difference in
> > > > probabilities would be 2^-32, not that the ratio of
> > > probabilities would
> > > > be 1 + 2^-32.
> > > >
> > > > By the way, I don't see the statement giving
> the ratio as
> > > 1.5, but
> > > > maybe
> > > > I was looking in the wrong place. In Theorem 1
> of the
> > paper
> > > I was
> > > > looking in the ratio was "1 + m 2^{-w + 1}".
> In that
> > formula
> > > m is your
> > > > n. If it is near 2^31, R uses w = 57 random
> bits, so
> > the ratio
> > > > would be
> > > > very, very small (one part in 2^25).
> > > >
> > > > The worst case for R would happen when m is
> just below
> > > 2^25, where w
> > > > is at least 31 for the default generators. In that
> > case the
> > > ratio
> > > > could
> > > > be about 1.03.
> > > >
> > > > Duncan Murdoch
> > > >
> > > >
> > > >
> > > > --
> > > > Philip B. Stark | Associate Dean, Mathematical and
> Physical
> > > Sciences |
> > > > Professor, Department of Statistics |
> > > > University of California
> > > > Berkeley, CA 94720-3860 | 510-394-5077 |
> > > statistics.berkeley.edu/~stark
> <http://statistics.berkeley.edu/%7Estark>
> > <http://statistics.berkeley.edu/%7Estark>
> > > <http://statistics.berkeley.edu/%7Estark>
> > > > <http://statistics.berkeley.edu/%7Estark> |
> > > > @philipbstark
> > > >
> > >
> > >
> > >
> > > --
> > > Philip B. Stark | Associate Dean, Mathematical and Physical
> > Sciences |
> > > Professor, Department of Statistics |
> > > University of California
> > > Berkeley, CA 94720-3860 | 510-394-5077 |
> > statistics.berkeley.edu/~stark
> <http://statistics.berkeley.edu/%7Estark>
> > <http://statistics.berkeley.edu/%7Estark>
> > > <http://statistics.berkeley.edu/%7Estark> |
> > > @philipbstark
> > >
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical
> Sciences |
> > Professor, Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 |
> statistics.berkeley.edu/~stark
> <http://statistics.berkeley.edu/%7Estark>
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
> ______________________________________________
> R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Sent from Gmail Mobile
More information about the R-devel
mailing list