[Rd] Bias in R's random integers?
Tierney, Luke
luke-tierney @ending from uiow@@edu
Fri Sep 21 18:38:15 CEST 2018
Not sure what should happen theoretically for the code in vseq.c, but
I see the same pattern with the R generators I tried (default,
Super-Duper, and L'Ecuyer) and with with bash $RANDOM using
N <- 10000
X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE)))
X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE)))
X <- X1 + 2 ^ 15 * (X2 > 2^14)
and with numbers from random.org
library(random)
X <- randomNumbers(N, 0, 2^16-1, col = 1)
So I'm not convinced there is an issue.
Best,
luke
On Fri, 21 Sep 2018, Steve Grubb wrote:
> Hello,
>
> Top posting. Several people have asked about the code to replicate my
> results. I have cleaned up the code to remove an x/y coordinate bias for
> displaying the results directly on a 640 x 480 VGA adapter. You can find the
> code here:
>
> http://people.redhat.com/sgrubb/files/vseq.c
>
> To collect R samples:
> X <- runif(10000, min = 0, max = 65535)
> write.table(X, file = "~/r-rand.txt", sep = "\n", row.names = FALSE)
>
> Then:
> cat ~/r-rand.txt | ./vseq > ~/r-rand.csv
>
> And then to create the chart:
>
> library(ggplot2);
> num.csv <- read.csv("~/random.csv", header=T)
> qplot(X, Y, data=num.csv);
>
> Hope this helps sort this out.
>
> Best Regards,
> -Steve
>
> On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb wrote:
>> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
>>> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
>>>> On 9/20/18 1:43 AM, Carl Boettiger wrote:
>>>>> For a well-tested C algorithm, based on my reading of Lemire, the
>>>>> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part
>>>>> already of the C standard library in OpenBSD and macOS (as
>>>>> arc4random_uniform), and in the GNU standard library. Lemire also
>>>>> provides C++ code in the appendix of his piece for both this and the
>>>>> faster "nearly divisionless" algorithm.
>>>>>
>>>>> It would be excellent if any R core members were interested in
>>>>> considering bindings to these algorithms as a patch, or might express
>>>>> expectations for how that patch would have to operate (e.g. re
>>>>> Duncan's
>>>>> comment about non-integer arguments to sample size). Otherwise, an R
>>>>> package binding seems like a good starting point, but I'm not the
>>>>> right
>>>>> volunteer.
>>>>
>>>> It is difficult to do this in a package, since R does not provide
>>>> access
>>>> to the random bits generated by the RNG. Only a float in (0,1) is
>>>> available via unif_rand().
>>>
>>> I believe it is safe to multiply the unif_rand() value by 2^32, and take
>>> the whole number part as an unsigned 32 bit integer. Depending on the
>>> RNG in use, that will give at least 25 random bits. (The low order bits
>>> are the questionable ones. 25 is just a guess, not a guarantee.)
>>>
>>> However, if one is willing to use an external
>>>
>>>> RNG, it is of course possible. After reading about Lemire's work [1], I
>>>> had planned to integrate such an unbiased sampling scheme into the
>>>> dqrng
>>>> package, which I have now started. [2]
>>>>
>>>> Using Duncan's example, the results look much better:
>>>>> library(dqrng)
>>>>> m <- (2/5)*2^32
>>>>> y <- dqsample(m, 1000000, replace = TRUE)
>>>>> table(y %% 2)
>>>>>
>>>> 0 1
>>>>
>>>> 500252 499748
>>>
>>> Another useful diagnostic is
>>>
>>> plot(density(y[y %% 2 == 0]))
>>>
>>> Obviously that should give a more or less uniform density, but for
>>> values near m, the default sample() gives some nice pretty pictures of
>>> quite non-uniform densities.
>>>
>>> By the way, there are actually quite a few examples of very large m
>>> besides m = (2/5)*2^32 where performance of sample() is noticeably bad.
>>> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a)
>>> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
>>> 3a)*2^32, etc.
>>>
>>> So perhaps I'm starting to be convinced that the default sample() should
>>> be fixed.
>>
>> I find this discussion fascinating. I normally test random numbers in
>> different languages every now and again using various methods. One simple
>> check that I do is to use Michal Zalewski's method when he studied Strange
>> Attractors and Initial TCP/IP Sequence Numbers:
>>
>> http://lcamtuf.coredump.cx/newtcp/
>> https://pdfs.semanticscholar.org/
>> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf
>>
>> The technique works by mapping the dynamics of the generated numbers into a
>> three-dimensional phase space. This is then plotted in a graph so that you
>> can visually see if something odd is going on.
>>
>> I used runif(10000, min = 0, max = 65535) to get a set of numbers. This
>> is the resulting plot that was generated from R's numbers using this
>> technique:
>>
>> http://people.redhat.com/sgrubb/files/r-random.jpg
>>
>> And for comparison this was generated by collecting the same number of
>> samples from the bash shell:
>>
>> http://people.redhat.com/sgrubb/files/bash-random.jpg
>>
>> The net result is that it shows some banding in the R generated random
>> numbers where bash has uniform random numbers with no discernible pattern.
>>
>> Best Regards,
>> -Steve
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke-tierney using uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
More information about the R-devel
mailing list