[Rd] Bias in R's random integers?
Ben Bolker
bbolker @ending from gm@il@com
Thu Sep 20 01:17:30 CEST 2018
A quick point of order here: arguing with Duncan in this forum is
helpful to expose ideas, but probably neither side will convince the
other; eventually, if you want this adopted in core R, you'll need to
convince an R-core member to pursue this fix.
In the meantime, a good, well-tested implementation in a
user-contributed package (presumably written in C for speed) would be
enormously useful. Volunteers ... ?
On 2018-09-19 04:19 PM, Duncan Murdoch wrote:
> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
>> Hi Duncan--
>>
>> Nice simulation!
>>
>> The absolute difference in probabilities is small, but the maximum
>> relative difference grows from something negligible to almost 2 as m
>> approaches 2**31.
>>
>> Because the L_1 distance between the uniform distribution on {1, ...,
>> m} and what you actually get is large, there have to be test functions
>> whose expectations are quite different under the two distributions.
>
> That is a mathematically true statement, but I suspect it is not very
> relevant. Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under the
> true distribution. Remember Von Neumann's "state of sin" quote. The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
>
> The practical question is whether such a function is likely to arise in
> practice or not.
>
>> Whether those correspond to commonly used statistics or not, I have no
>> idea.
>
> I am pretty confident that this bug rarely matters.
>
>> Regarding backwards compatibility: as a user, I'd rather the default
>> sample() do the best possible thing, and take an extra step to use
>> something like sample(..., legacy=TRUE) if I want to reproduce old
>> results.
>
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed. Then the rejection sampling approach
> would need to be quite a bit more complicated.
>
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample(). But as I said in
> my original post, a contribution of a function without this bug would be
> a nice addition.
>
> Duncan Murdoch
>
>>
>> Regards,
>> Philip
>>
>> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
>> <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>> wrote:
>>
>> On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>> > No, the 2nd call only happens when m > 2**31. Here's the code:
>>
>> Yes, you're right. Sorry!
>>
>> So the ratio really does come close to 2. However, the difference in
>> probabilities between outcomes is still at most 2^-32 when m is less
>> than that cutoff. That's not feasible to detect; the only detectable
>> difference would happen if some event was constructed to hold an
>> abundance of outcomes with especially low (or especially high)
>> probability.
>>
>> As I said in my original post, it's probably not hard to construct
>> such
>> a thing, but as I've said more recently, it probably wouldn't
>> happen by
>> chance. Here's one attempt to do it:
>>
>> Call the values from unif_rand() "the unif_rand() outcomes". Call
>> the
>> values from sample() the sample outcomes.
>>
>> It would be easiest to see the error if half of the sample() outcomes
>> used two unif_rand() outcomes, and half used just one. That would
>> mean
>> m should be (2/3) * 2^32, but that's too big and would trigger the
>> other
>> version.
>>
>> So how about half use 2 unif_rands(), and half use 3? That means m =
>> (2/5) * 2^32 = 1717986918. A good guess is that sample() outcomes
>> would
>> alternate between the two possibilities, so our event could be even
>> versus odd outcomes.
>>
>> Let's try it:
>>
>> > m <- (2/5)*2^32
>> > m > 2^31
>> [1] FALSE
>> > x <- sample(m, 1000000, replace = TRUE)
>> > table(x %% 2)
>>
>> 0 1
>> 399850 600150
>>
>> Since m is an even number, the true proportions of evens and odds
>> should
>> be exactly 0.5. That's some pretty strong evidence of the bug in the
>> generator. (Note that the ratio of the observed probabilities is
>> about
>> 1.5, so I may not be the first person to have done this.)
>>
>> I'm still not convinced that there has ever been a simulation run
>> with
>> detectable bias compared to Monte Carlo error unless it (like this
>> one)
>> was designed specifically to show the problem.
>>
>> Duncan Murdoch
>>
>> >
>> > (RNG.c, lines 793ff)
>> >
>> > double R_unif_index(double dn)
>> > {
>> > double cut = INT_MAX;
>> >
>> > switch(RNG_kind) {
>> > case KNUTH_TAOCP:
>> > case USER_UNIF:
>> > case KNUTH_TAOCP2:
>> > cut = 33554431.0; /* 2^25 - 1 */
>> > break;
>> > default:
>> > break;
>> > }
>> >
>> > double u = dn > cut ? ru() : unif_rand();
>> > return floor(dn * u);
>> > }
>> >
>> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>> <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
>> > <mailto:murdoch.duncan using gmail.com
>> <mailto:murdoch.duncan using gmail.com>>> wrote:
>> >
>> > On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>> > > The 53 bits only encode at most 2^{32} possible values,
>> because the
>> > > source of the float is the output of a 32-bit PRNG (the
>> obsolete
>> > version
>> > > of MT). 53 bits isn't the relevant number here.
>> >
>> > No, two calls to unif_rand() are used. There are two 32 bit
>> values,
>> > but
>> > some of the bits are thrown away.
>> >
>> > Duncan Murdoch
>> >
>> > >
>> > > The selection ratios can get close to 2. Computer
>> scientists
>> > don't do it
>> > > the way R does, for a reason.
>> > >
>> > > Regards,
>> > > Philip
>> > >
>> > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>> > <murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>
>> <mailto:murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>
>> > > <mailto:murdoch.duncan using gmail.com
>> <mailto:murdoch.duncan using gmail.com>
>> > <mailto:murdoch.duncan using gmail.com
>> <mailto:murdoch.duncan using gmail.com>>>> wrote:
>> > >
>> > > On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>> > > > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>> > > > (<murdoch.duncan using gmail.com
>> <mailto:murdoch.duncan using gmail.com>
>> > <mailto:murdoch.duncan using gmail.com
>> <mailto:murdoch.duncan using gmail.com>> <mailto:murdoch.duncan using gmail.com
>> <mailto:murdoch.duncan using gmail.com>
>> > <mailto:murdoch.duncan using gmail.com
>> <mailto:murdoch.duncan using gmail.com>>>>)
>> > > escribió:
>> > > >>
>> > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>> > > >>> Dear list,
>> > > >>>
>> > > >>> It looks to me that R samples random integers
>> using an
>> > > intuitive but biased
>> > > >>> algorithm by going from a random number on
>> [0,1) from
>> > the PRNG
>> > > to a random
>> > > >>> integer, e.g.
>> > > >>>
>> > >
>> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>> > > >>>
>> > > >>> Many other languages use various rejection
>> sampling
>> > approaches
>> > > which
>> > > >>> provide an unbiased method for sampling, such as
>> in Go,
>> > python,
>> > > and others
>> > > >>> described here:
>> https://arxiv.org/abs/1805.10941 (I
>> > believe the
>> > > biased
>> > > >>> algorithm currently used in R is also described
>> there). I'm
>> > > not an expert
>> > > >>> in this area, but does it make sense for the R to
>> adopt
>> > one of
>> > > the unbiased
>> > > >>> random sample algorithms outlined there and used
>> in other
>> > > languages? Would
>> > > >>> a patch providing such an algorithm be welcome?
>> What
>> > concerns
>> > > would need to
>> > > >>> be addressed first?
>> > > >>>
>> > > >>> I believe this issue was also raised by Killie &
>> Philip in
>> > > >>>
>> > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>> > > more
>> > > >>> recently in
>> > > >>>
>> > >
>> >
>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>>
>> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>> >
>> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>> > >
>> >
>> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>> > > >>> pointing to the python implementation for
>> comparison:
>> > > >>>
>> > >
>> >
>>
>> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>>
>> > > >>
>> > > >> I think the analyses are correct, but I doubt if a
>> change
>> > to the
>> > > default
>> > > >> is likely to be accepted as it would make it more
>> > difficult to
>> > > reproduce
>> > > >> older results.
>> > > >>
>> > > >> On the other hand, a contribution of a new
>> function like
>> > > sample() but
>> > > >> not suffering from the bias would be good. The
>> normal way to
>> > > make such
>> > > >> a contribution is in a user contributed package.
>> > > >>
>> > > >> By the way, R code illustrating the bias is
>> probably not very
>> > > hard to
>> > > >> put together. I believe the bias manifests
>> itself in
>> > sample()
>> > > producing
>> > > >> values with two different probabilities (instead
>> of all equal
>> > > >> probabilities). Those may differ by as much as
>> one part in
>> > > 2^32. It's
>> > > >
>> > > > According to Kellie and Philip, in the attachment
>> of the
>> > thread
>> > > > referenced by Carl, "The maximum ratio of selection
>> > probabilities can
>> > > > get as large as 1.5 if n is just below 2^31".
>> > >
>> > > Sorry, I didn't write very well. I meant to say
>> that the
>> > difference in
>> > > probabilities would be 2^-32, not that the ratio of
>> > probabilities would
>> > > be 1 + 2^-32.
>> > >
>> > > By the way, I don't see the statement giving the
>> ratio as
>> > 1.5, but
>> > > maybe
>> > > I was looking in the wrong place. In Theorem 1 of the
>> paper
>> > I was
>> > > looking in the ratio was "1 + m 2^{-w + 1}". In that
>> formula
>> > m is your
>> > > n. If it is near 2^31, R uses w = 57 random bits, so
>> the ratio
>> > > would be
>> > > very, very small (one part in 2^25).
>> > >
>> > > The worst case for R would happen when m is just below
>> > 2^25, where w
>> > > is at least 31 for the default generators. In that
>> case the
>> > ratio
>> > > could
>> > > be about 1.03.
>> > >
>> > > Duncan Murdoch
>> > >
>> > >
>> > >
>> > > --
>> > > Philip B. Stark | Associate Dean, Mathematical and Physical
>> > Sciences |
>> > > Professor, Department of Statistics |
>> > > University of California
>> > > Berkeley, CA 94720-3860 | 510-394-5077 |
>> > statistics.berkeley.edu/~stark
>> <http://statistics.berkeley.edu/%7Estark>
>> > <http://statistics.berkeley.edu/%7Estark>
>> > > <http://statistics.berkeley.edu/%7Estark> |
>> > > @philipbstark
>> > >
>> >
>> >
>> >
>> > --
>> > Philip B. Stark | Associate Dean, Mathematical and Physical
>> Sciences |
>> > Professor, Department of Statistics |
>> > University of California
>> > Berkeley, CA 94720-3860 | 510-394-5077 |
>> statistics.berkeley.edu/~stark
>> <http://statistics.berkeley.edu/%7Estark>
>> > <http://statistics.berkeley.edu/%7Estark> |
>> > @philipbstark
>> >
>>
>>
>>
>> --
>> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
>> Professor, Department of Statistics |
>> University of California
>> Berkeley, CA 94720-3860 | 510-394-5077 |
>> statistics.berkeley.edu/~stark
>> <http://statistics.berkeley.edu/%7Estark> |
>> @philipbstark
>>
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list