[Rd] Bias in R's random integers?

Thu Sep 20 19:57:00 CEST 2018

Hi,

Note that it wouldn't be the first time that sample() changes behavior
in a non-backward compatible way:

   https://stat.ethz.ch/pipermail/r-devel/2012-October/065049.html

Cheers,
H.

On 09/20/2018 08:15 AM, Duncan Murdoch wrote:
> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
>> On 9/20/18 1:43 AM, Carl Boettiger wrote:
>>> For a well-tested C algorithm, based on my reading of Lemire, the 
>>> unbiased
>>> "algorithm 3" in 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1805.10941&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=TtofIDsvWasBZGzOl9J0kBQnJMksr2Rg3u1l8CM5-qE&e= 
>>> is part already of the C
>>> standard library in OpenBSD and macOS (as arc4random_uniform), and in 
>>> the
>>> GNU standard library.  Lemire also provides C++ code in the appendix 
>>> of his
>>> piece for both this and the faster "nearly divisionless" algorithm.
>>>
>>> It would be excellent if any R core members were interested in 
>>> considering
>>> bindings to these algorithms as a patch, or might express 
>>> expectations for
>>> how that patch would have to operate (e.g. re Duncan's comment about
>>> non-integer arguments to sample size).  Otherwise, an R package binding
>>> seems like a good starting point, but I'm not the right volunteer.
>> It is difficult to do this in a package, since R does not provide access
>> to the random bits generated by the RNG. Only a float in (0,1) is
>> available via unif_rand(). 
> 
> I believe it is safe to multiply the unif_rand() value by 2^32, and take 
> the whole number part as an unsigned 32 bit integer.  Depending on the 
> RNG in use, that will give at least 25 random bits.  (The low order bits 
> are the questionable ones.  25 is just a guess, not a guarantee.)
> 
> However, if one is willing to use an external
>> RNG, it is of course possible. After reading about Lemire's work [1], I
>> had planned to integrate such an unbiased sampling scheme into the dqrng
>> package, which I have now started. [2]
>>
>> Using Duncan's example, the results look much better:
>>
>>> library(dqrng)
>>> m <- (2/5)*2^32
>>> y <- dqsample(m, 1000000, replace = TRUE)
>>> table(y %% 2)
>>
>>       0      1
>> 500252 499748
> 
> Another useful diagnostic is
> 
>    plot(density(y[y %% 2 == 0]))
> 
> Obviously that should give a more or less uniform density, but for 
> values near m, the default sample() gives some nice pretty pictures of 
> quite non-uniform densities.
> 
> By the way, there are actually quite a few examples of very large m 
> besides m = (2/5)*2^32 where performance of sample() is noticeably bad. 
> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) 
> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + 
> 3a)*2^32, etc.
> 
> So perhaps I'm starting to be convinced that the default sample() should 
> be fixed.
> 
> Duncan Murdoch
> 
> 
>>
>> Currently I am taking the other interpretation of "truncated":
>>
>>> table(dqsample(2.5, 1000000, replace = TRUE))
>>
>>       0      1
>> 499894 500106
>>
>> I will adjust this to whatever is decided for base R.
>>
>>
>> However, there is currently neither long vector nor weighted sampling
>> support. And the performance without replacement is quite bad compared
>> to R's algorithm with hashing.
>>
>> cheerio
>> ralf
>>
>> [1] via 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pcg-2Drandom.org_posts_bounded-2Drands.html&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=OlX-dzwoOeFlod3Gofa_1TQaZwmjsCH9C9v3lM5Y2rY&e= 
>>
>> [2] 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_daqana_dqrng_tree_feature_sample&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=DNaSqRCy89Hvbg1G0SpyEL0kkr9_RqWXi9pTy75V32M&e= 
>>
>>
>>
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=WOx4NyeYmWxpDG3tBRQ9-_Y3_7YAlKUKOP6gZLs0BrQ&e= 
>>
>>
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=WOx4NyeYmWxpDG3tBRQ9-_Y3_7YAlKUKOP6gZLs0BrQ&e= 
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319