[Rd] bias issue in sample() (PR 17494)

Tierney, Luke |uke-t|erney @end|ng |rom u|ow@@edu
Tue Feb 26 16:41:51 CET 2019


On Tue, 26 Feb 2019, Kirill Müller wrote:

> Ralf
>
>
> I don't doubt this is expected with the current implementation, I doubt the 
> implementation is desirable. Suggesting to turn this to
>
> pbirthday(1e6, classes = 2^53)
> ## [1] 5.550956e-05

That isn't a small number given simulation sizes people routinely run
these days. Just about right to miss an issue in a pilot run and get
bitten on the real one.

In the inversion generator for normals we already use a higher
resolution uniform produced from two regular ones. I considered
switching to that approach for all uniforms, either in addition to or
instead of changing the uniform integer sampling algorithm used in
sample(). But that would have been even more disruptive:

- all simulation results (except normals) would change;
- there would be a performance penalty;
- the streams would be used up twice as fast;

I would also probably be necessary to rethink things like how to use
the L'Ecuyer generator to produce multiple streams in the `parallel`
package.

We may need to take this route in the future, but it didn't seem like
a good idea at this time.

Best,

luke

>
> (which is still non-zero, but much less likely to cause confusion.)
>
>
> Best regards
>
> Kirill
>
> On 26.02.19 10:18, Ralf Stubner wrote:
>> Kirill,
>> 
>> I think some level of collision is actually expected! R uses a 32bit MT
>> that can produce 2^32 different doubles. The probability for a collision
>> within a million draws is
>> 
>>> pbirthday(1e6, classes = 2^32)
>> [1] 1
>> 
>> Greetings
>> Ralf
>> 
>> 
>> On 26.02.19 07:06, Kirill Müller wrote:
>>> Gabe
>>> 
>>> 
>>> As mentioned on Twitter, I think the following behavior should be fixed
>>> as part of the upcoming changes:
>>> 
>>> R.version.string
>>> ## [1] "R Under development (unstable) (2019-02-25 r76160)"
>>> .Machine$double.digits
>>> ## [1] 53
>>> set.seed(123)
>>> RNGkind()
>>> ## [1] "Mersenne-Twister" "Inversion"        "Rejection"
>>> length(table(runif(1e6)))
>>> ## [1] 999863
>>> 
>>> I don't expect any collisions when using Mersenne-Twister to generate a
>>> million floating point values. I'm not sure what causes this behavior,
>>> but it's documented in ?Random:
>>> 
>>> "Do not rely on randomness of low-order bits from RNGs. Most of the
>>> supplied uniform generators return 32-bit integer values that are
>>> converted to doubles, so they take at most 2^32 distinct values and long
>>> runs will return duplicated values (Wichmann-Hill is the exception, and
>>> all give at least 30 varying bits.)"
>>> 
>>> The "Wichman-Hill" bit is interesting:
>>> 
>>> RNGkind("Wichmann-Hill")
>>> length(table(runif(1e6)))
>>> ## [1] 1000000
>>> length(table(runif(1e6)))
>>> ## [1] 1000000
>>> 
>>> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
>>> it would be great to see the above behavior also for Mersenne-Twister.
>>> Thanks for considering.
>>> 
>>> 
>>> Best regards
>>> 
>>> Kirill
>>> 
>>> 
>>> On 20.02.19 08:01, Gabriel Becker wrote:
>>>> Luke,
>>>> 
>>>> I'm happy to help with this. Its great to see this get tackled (I've
>>>> cc'ed
>>>> Kelli Ottoboni who helped flag this issue).
>>>> 
>>>> I can prepare a patch for the RNGkind related stuff and the doc update.
>>>> 
>>>> As for ???, what are your (and others') thoughts about the possibility of
>>>> a) a reproducibility API which takes either an R version (or maybe
>>>> alternatively a date) and sets the RNGkind to the default for that
>>>> version/date, and/or b) that sessionInfo be modified to capture (and
>>>> display) the RNGkind in effect.
>>>> 
>>>> Best,
>>>> ~G
>>>> 
>>>> 
>>>> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <luke-tierney using uiowa.edu>
>>>> wrote:
>>>> 
>>>>> Before the next release we really should to sort out the bias issue in
>>>>> sample() reported by Ottoboni and Stark in
>>>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
>>>>> filed aa a bug report by Duncan Murdoch at
>>>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>>>>> 
>>>>> Here are two examples of bad behavior through current R-devel:
>>>>>
>>>>>        set.seed(123)
>>>>>        m <- (2/5) * 2^32
>>>>>        x <- sample(m, 1000000, replace = TRUE)
>>>>>        table(x %% 2, x > m / 2)
>>>>>        ##
>>>>>        ##    FALSE   TRUE
>>>>>        ## 0 300620 198792
>>>>>        ## 1 200196 300392
>>>>>
>>>>>        table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
>>>>>        ##
>>>>>        ##      0      1
>>>>>        ## 429054 570946
>>>>> 
>>>>> I committed a modification to R_unif_index to address this by
>>>>> generating random bits (blocks of 16) and rejection sampling, but for
>>>>> now this is only enabled if the environment variable R_NEW_SAMPLE is
>>>>> set before the first call.
>>>>> 
>>>>> Some things still needed:
>>>>> 
>>>>> - someone to look over the change and see if there are any issues
>>>>> - adjustment of RNGkind to allowing the old behavior to be selected
>>>>> - make the new behavior the default
>>>>> - adjust documentation
>>>>> - ???
>>>>> 
>>>>> Unfortunately I don't have enough free cycles to do this, but I can
>>>>> help if someone else can take the lead.
>>>>> 
>>>>> There are two other places I found that might suffer from the same
>>>>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
>>>>> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
>>>>> have done that for walker_ProbSampleReplace, but the wilcox change
>>>>> might need adjusting to support the standalone math library and I
>>>>> don't feel confident enough I'd get that right.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> luke
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Luke Tierney
>>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>>> University of Iowa                  Phone:             319-335-3386
>>>>> Department of Statistics and        Fax:               319-335-3017
>>>>>       Actuarial Science
>>>>> 241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
>>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>>>> 
>>>>> ______________________________________________
>>>>> R-devel using r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>      [[alternative HTML version deleted]]
>>>> 
>>>> ______________________________________________
>>>> R-devel using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu


More information about the R-devel mailing list