[Rd] Bias in R's random integers?
Philip B. Stark
@t@rk @ending from @t@t@berkeley@edu
Wed Sep 19 18:23:07 CEST 2018
No, the 2nd call only happens when m > 2**31. Here's the code:
(RNG.c, lines 793ff)
double R_unif_index(double dn)
{
double cut = INT_MAX;
switch(RNG_kind) {
case KNUTH_TAOCP:
case USER_UNIF:
case KNUTH_TAOCP2:
cut = 33554431.0; /* 2^25 - 1 */
break;
default:
break;
}
double u = dn > cut ? ru() : unif_rand();
return floor(dn * u);
}
On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <murdoch.duncan using gmail.com>
wrote:
> On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> > The 53 bits only encode at most 2^{32} possible values, because the
> > source of the float is the output of a 32-bit PRNG (the obsolete version
> > of MT). 53 bits isn't the relevant number here.
>
> No, two calls to unif_rand() are used. There are two 32 bit values, but
> some of the bits are thrown away.
>
> Duncan Murdoch
>
> >
> > The selection ratios can get close to 2. Computer scientists don't do it
> > the way R does, for a reason.
> >
> > Regards,
> > Philip
> >
> > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>> wrote:
> >
> > On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> > > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> > > (<murdoch.duncan using gmail.com <mailto:murdoch.duncan using gmail.com>>)
> > escribió:
> > >>
> > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> > >>> Dear list,
> > >>>
> > >>> It looks to me that R samples random integers using an
> > intuitive but biased
> > >>> algorithm by going from a random number on [0,1) from the PRNG
> > to a random
> > >>> integer, e.g.
> > >>>
> >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> > >>>
> > >>> Many other languages use various rejection sampling approaches
> > which
> > >>> provide an unbiased method for sampling, such as in Go, python,
> > and others
> > >>> described here: https://arxiv.org/abs/1805.10941 (I believe the
> > biased
> > >>> algorithm currently used in R is also described there). I'm
> > not an expert
> > >>> in this area, but does it make sense for the R to adopt one of
> > the unbiased
> > >>> random sample algorithms outlined there and used in other
> > languages? Would
> > >>> a patch providing such an algorithm be welcome? What concerns
> > would need to
> > >>> be addressed first?
> > >>>
> > >>> I believe this issue was also raised by Killie & Philip in
> > >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> > more
> > >>> recently in
> > >>>
> > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> > <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> > >>> pointing to the python implementation for comparison:
> > >>>
> >
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> > >>
> > >> I think the analyses are correct, but I doubt if a change to the
> > default
> > >> is likely to be accepted as it would make it more difficult to
> > reproduce
> > >> older results.
> > >>
> > >> On the other hand, a contribution of a new function like
> > sample() but
> > >> not suffering from the bias would be good. The normal way to
> > make such
> > >> a contribution is in a user contributed package.
> > >>
> > >> By the way, R code illustrating the bias is probably not very
> > hard to
> > >> put together. I believe the bias manifests itself in sample()
> > producing
> > >> values with two different probabilities (instead of all equal
> > >> probabilities). Those may differ by as much as one part in
> > 2^32. It's
> > >
> > > According to Kellie and Philip, in the attachment of the thread
> > > referenced by Carl, "The maximum ratio of selection probabilities
> can
> > > get as large as 1.5 if n is just below 2^31".
> >
> > Sorry, I didn't write very well. I meant to say that the difference
> in
> > probabilities would be 2^-32, not that the ratio of probabilities
> would
> > be 1 + 2^-32.
> >
> > By the way, I don't see the statement giving the ratio as 1.5, but
> > maybe
> > I was looking in the wrong place. In Theorem 1 of the paper I was
> > looking in the ratio was "1 + m 2^{-w + 1}". In that formula m is
> your
> > n. If it is near 2^31, R uses w = 57 random bits, so the ratio
> > would be
> > very, very small (one part in 2^25).
> >
> > The worst case for R would happen when m is just below 2^25, where
> w
> > is at least 31 for the default generators. In that case the ratio
> > could
> > be about 1.03.
> >
> > Duncan Murdoch
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor, Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
>
--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark
[[alternative HTML version deleted]]
More information about the R-devel
mailing list