[Rd] Bias in R's random integers?
Carl Boettiger
cboettig @ending from gm@il@com
Thu Sep 20 01:43:38 CEST 2018
For a well-tested C algorithm, based on my reading of Lemire, the unbiased
"algorithm 3" in https://arxiv.org/abs/1805.10941 is part already of the C
standard library in OpenBSD and macOS (as arc4random_uniform), and in the
GNU standard library. Lemire also provides C++ code in the appendix of his
piece for both this and the faster "nearly divisionless" algorithm.
It would be excellent if any R core members were interested in considering
bindings to these algorithms as a patch, or might express expectations for
how that patch would have to operate (e.g. re Duncan's comment about
non-integer arguments to sample size). Otherwise, an R package binding
seems like a good starting point, but I'm not the right volunteer.
On Wed, Sep 19, 2018 at 4:22 PM Ben Bolker <bbolker using gmail.com> wrote:
>
> A quick point of order here: arguing with Duncan in this forum is
> helpful to expose ideas, but probably neither side will convince the
> other; eventually, if you want this adopted in core R, you'll need to
> convince an R-core member to pursue this fix.
>
> In the meantime, a good, well-tested implementation in a
> user-contributed package (presumably written in C for speed) would be
> enormously useful. Volunteers ... ?
>
>
>
> On 2018-09-19 04:19 PM, Duncan Murdoch wrote:
> > On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> >> Hi Duncan--
> >>
> >> Nice simulation!
> >>
> >> The absolute difference in probabilities is small, but the maximum
> >> relative difference grows from something negligible to almost 2 as m
> >> approaches 2**31.
> >>
> >> Because the L_1 distance between the uniform distribution on {1, ...,
> >> m} and what you actually get is large, there have to be test functions
> >> whose expectations are quite different under the two distributions.
> >
> > That is a mathematically true statement, but I suspect it is not very
> > relevant. Pseudo-random number generators always have test functions
> > whose sample averages are quite different from the expectation under the
> > true distribution. Remember Von Neumann's "state of sin" quote. The
> > bug in sample() just means it is easier to find such a function than it
> > would otherwise be.
> >
> > The practical question is whether such a function is likely to arise in
> > practice or not.
> >
> >> Whether those correspond to commonly used statistics or not, I have no
> >> idea.
> >
> > I am pretty confident that this bug rarely matters.
> >
> >> Regarding backwards compatibility: as a user, I'd rather the default
> >> sample() do the best possible thing, and take an extra step to use
> >> something like sample(..., legacy=TRUE) if I want to reproduce old
> >> results.
> >
> > I suspect there's a good chance the bug I discovered today (non-integer
> > x values not being truncated) will be declared to be a feature, and the
> > documentation will be changed. Then the rejection sampling approach
> > would need to be quite a bit more complicated.
> >
> > I think a documentation warning about the accuracy of sampling
> > probabilities would also be a sufficient fix here, and would be quite a
> > bit less trouble than changing the default sample(). But as I said in
> > my original post, a contribution of a function without this bug would be
> > a nice addition.
> >
> > Duncan Murdoch
> >
> >>
> >> Regards,
> >> Philip
> >>
> >> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
> >> <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>> wrote:
> >>
> >> On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >> > No, the 2nd call only happens when m > 2**31. Here's the code:
> >>
> >> Yes, you're right. Sorry!
> >>
> >> So the ratio really does come close to 2. However, the difference
> in
> >> probabilities between outcomes is still at most 2^-32 when m is less
> >> than that cutoff. That's not feasible to detect; the only
> detectable
> >> difference would happen if some event was constructed to hold an
> >> abundance of outcomes with especially low (or especially high)
> >> probability.
> >>
> >> As I said in my original post, it's probably not hard to construct
> >> such
> >> a thing, but as I've said more recently, it probably wouldn't
> >> happen by
> >> chance. Here's one attempt to do it:
> >>
> >> Call the values from unif_rand() "the unif_rand() outcomes". Call
> >> the
> >> values from sample() the sample outcomes.
> >>
> >> It would be easiest to see the error if half of the sample()
> outcomes
> >> used two unif_rand() outcomes, and half used just one. That would
> >> mean
> >> m should be (2/3) * 2^32, but that's too big and would trigger the
> >> other
> >> version.
> >>
> >> So how about half use 2 unif_rands(), and half use 3? That means m
> =
> >> (2/5) * 2^32 = 1717986918. A good guess is that sample() outcomes
> >> would
> >> alternate between the two possibilities, so our event could be even
> >> versus odd outcomes.
> >>
> >> Let's try it:
> >>
> >> > m <- (2/5)*2^32
> >> > m > 2^31
> >> [1] FALSE
> >> > x <- sample(m, 1000000, replace = TRUE)
> >> > table(x %% 2)
> >>
> >> 0 1
> >> 399850 600150
> >>
> >> Since m is an even number, the true proportions of evens and odds
> >> should
> >> be exactly 0.5. That's some pretty strong evidence of the bug in
> the
> >> generator. (Note that the ratio of the observed probabilities is
> >> about
> >> 1.5, so I may not be the first person to have done this.)
> >>
> >> I'm still not convinced that there has ever been a simulation run
> >> with
> >> detectable bias compared to Monte Carlo error unless it (like this
> >> one)
> >> was designed specifically to show the problem.
> >>
> >> Duncan Murdoch
> >>
> >> >
> >> > (RNG.c, lines 793ff)
> >> >
> >> > double R_unif_index(double dn)
> >> > {
> >> > double cut = INT_MAX;
> >> >
> >> > switch(RNG_kind) {
> >> > case KNUTH_TAOCP:
> >> > case USER_UNIF:
> >> > case KNUTH_TAOCP2:
> >> > cut = 33554431.0; /* 2^25 - 1 */
> >> > break;
> >> > default:
> >> > break;
> >> > }
> >> >
> >> > double u = dn > cut ? ru() : unif_rand();
> >> > return floor(dn * u);
> >> > }
> >> >
> >> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> >> <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> >> > <mailto:murdoch.duncan using gmail.com
> >> <mailto:murdoch.duncan using gmail.com>>> wrote:
> >> >
> >> > On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >> > > The 53 bits only encode at most 2^{32} possible values,
> >> because the
> >> > > source of the float is the output of a 32-bit PRNG (the
> >> obsolete
> >> > version
> >> > > of MT). 53 bits isn't the relevant number here.
> >> >
> >> > No, two calls to unif_rand() are used. There are two 32 bit
> >> values,
> >> > but
> >> > some of the bits are thrown away.
> >> >
> >> > Duncan Murdoch
> >> >
> >> > >
> >> > > The selection ratios can get close to 2. Computer
> >> scientists
> >> > don't do it
> >> > > the way R does, for a reason.
> >> > >
> >> > > Regards,
> >> > > Philip
> >> > >
> >> > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >> > <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
> >> <mailto:murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>
> >> > > <mailto:murdoch.duncan using gmail.com
> >> <mailto:murdoch.duncan using gmail.com>
> >> > <mailto:murdoch.duncan using gmail.com
> >> <mailto:murdoch.duncan using gmail.com>>>> wrote:
> >> > >
> >> > > On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> >> > > > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> >> > > > (<murdoch.duncan using gmail.com
> >> <mailto:murdoch.duncan using gmail.com>
> >> > <mailto:murdoch.duncan using gmail.com
> >> <mailto:murdoch.duncan using gmail.com>> <mailto:murdoch.duncan using gmail.com
> >> <mailto:murdoch.duncan using gmail.com>
> >> > <mailto:murdoch.duncan using gmail.com
> >> <mailto:murdoch.duncan using gmail.com>>>>)
> >> > > escribió:
> >> > > >>
> >> > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >> > > >>> Dear list,
> >> > > >>>
> >> > > >>> It looks to me that R samples random integers
> >> using an
> >> > > intuitive but biased
> >> > > >>> algorithm by going from a random number on
> >> [0,1) from
> >> > the PRNG
> >> > > to a random
> >> > > >>> integer, e.g.
> >> > > >>>
> >> > >
> >>
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >> > > >>>
> >> > > >>> Many other languages use various rejection
> >> sampling
> >> > approaches
> >> > > which
> >> > > >>> provide an unbiased method for sampling, such as
> >> in Go,
> >> > python,
> >> > > and others
> >> > > >>> described here:
> >> https://arxiv.org/abs/1805.10941 (I
> >> > believe the
> >> > > biased
> >> > > >>> algorithm currently used in R is also described
> >> there). I'm
> >> > > not an expert
> >> > > >>> in this area, but does it make sense for the R to
> >> adopt
> >> > one of
> >> > > the unbiased
> >> > > >>> random sample algorithms outlined there and used
> >> in other
> >> > > languages? Would
> >> > > >>> a patch providing such an algorithm be welcome?
> >> What
> >> > concerns
> >> > > would need to
> >> > > >>> be addressed first?
> >> > > >>>
> >> > > >>> I believe this issue was also raised by Killie &
> >> Philip in
> >> > > >>>
> >> > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >> > > more
> >> > > >>> recently in
> >> > > >>>
> >> > >
> >> >
> >> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >>
> >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >> >
> >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >> > >
> >> >
> >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf
> >,
> >> > > >>> pointing to the python implementation for
> >> comparison:
> >> > > >>>
> >> > >
> >> >
> >>
> >>
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >>
> >> > > >>
> >> > > >> I think the analyses are correct, but I doubt if a
> >> change
> >> > to the
> >> > > default
> >> > > >> is likely to be accepted as it would make it more
> >> > difficult to
> >> > > reproduce
> >> > > >> older results.
> >> > > >>
> >> > > >> On the other hand, a contribution of a new
> >> function like
> >> > > sample() but
> >> > > >> not suffering from the bias would be good. The
> >> normal way to
> >> > > make such
> >> > > >> a contribution is in a user contributed package.
> >> > > >>
> >> > > >> By the way, R code illustrating the bias is
> >> probably not very
> >> > > hard to
> >> > > >> put together. I believe the bias manifests
> >> itself in
> >> > sample()
> >> > > producing
> >> > > >> values with two different probabilities (instead
> >> of all equal
> >> > > >> probabilities). Those may differ by as much as
> >> one part in
> >> > > 2^32. It's
> >> > > >
> >> > > > According to Kellie and Philip, in the attachment
> >> of the
> >> > thread
> >> > > > referenced by Carl, "The maximum ratio of selection
> >> > probabilities can
> >> > > > get as large as 1.5 if n is just below 2^31".
> >> > >
> >> > > Sorry, I didn't write very well. I meant to say
> >> that the
> >> > difference in
> >> > > probabilities would be 2^-32, not that the ratio of
> >> > probabilities would
> >> > > be 1 + 2^-32.
> >> > >
> >> > > By the way, I don't see the statement giving the
> >> ratio as
> >> > 1.5, but
> >> > > maybe
> >> > > I was looking in the wrong place. In Theorem 1 of the
> >> paper
> >> > I was
> >> > > looking in the ratio was "1 + m 2^{-w + 1}". In that
> >> formula
> >> > m is your
> >> > > n. If it is near 2^31, R uses w = 57 random bits, so
> >> the ratio
> >> > > would be
> >> > > very, very small (one part in 2^25).
> >> > >
> >> > > The worst case for R would happen when m is just
> below
> >> > 2^25, where w
> >> > > is at least 31 for the default generators. In that
> >> case the
> >> > ratio
> >> > > could
> >> > > be about 1.03.
> >> > >
> >> > > Duncan Murdoch
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Philip B. Stark | Associate Dean, Mathematical and
> Physical
> >> > Sciences |
> >> > > Professor, Department of Statistics |
> >> > > University of California
> >> > > Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077>
> |
> >> > statistics.berkeley.edu/~stark
> >> <http://statistics.berkeley.edu/%7Estark>
> >> > <http://statistics.berkeley.edu/%7Estark>
> >> > > <http://statistics.berkeley.edu/%7Estark> |
> >> > > @philipbstark
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Philip B. Stark | Associate Dean, Mathematical and Physical
> >> Sciences |
> >> > Professor, Department of Statistics |
> >> > University of California
> >> > Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077> |
> >> statistics.berkeley.edu/~stark
> >> <http://statistics.berkeley.edu/%7Estark>
> >> > <http://statistics.berkeley.edu/%7Estark> |
> >> > @philipbstark
> >> >
> >>
> >>
> >>
> >> --
> >> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> >> Professor, Department of Statistics |
> >> University of California
> >> Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077> |
> >> statistics.berkeley.edu/~stark
> >> <http://statistics.berkeley.edu/%7Estark> |
> >> @philipbstark
> >>
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
http://carlboettiger.info
[[alternative HTML version deleted]]
More information about the R-devel
mailing list