[R] Wichmann-Hill Random Number Generator and the Birthday Problem

Shengqiao Li shli at stat.wvu.edu
Sun Aug 17 20:18:35 CEST 2008


On Sun, 17 Aug 2008, Duncan Murdoch wrote:

> Shengqiao Li wrote:
>> Dear all,
>> 
>> Recently I am generating large random samples (10M) and any duplicated 
>> numbers are not desired.
>> We tried several RNGs in R and found Wichmann-Hill did not produce 
>> duplications.
>> 
>> The duplication problem is the interesting birthday problem. If there are M 
>> possible numbers, randomly draw N numbers from them,
>> the average number of dupilcations D = N(N-1)/2/M.
>> 
>> For Knuth-TAOCP and Knuth-TAOCP-2002, M=2^30, since this modulus is used. D 
>> = 46566.12 for N=10M samples.
>> 
>> For Marsaglia-Multicarry, Super-Duper and Mersene-Twister, M=2^32. D = 
>> 11641.53 for N = 10M samples.
>> 
>> My testing results (see below) agree with above analysis. But for 
>> Wichmann-Hill, it wasn't. Wichmann-Hill's cycle is 6.9536e12 (refer to RNG 
>> help by ?RNG and Whichmann's correction in 1984). Thus M <= 6.9536e12. D 
>>> = 7.19052 when N=1e7 and D>= 179.763 for N=5e7. 
>> But in my tests, duplications were surprisingly not observed.
>> 
>> It seems that Wichmann-Hill has a much larger cycle than the one 
>> documented!
>> 
>> Anybody can solve this puzzle?
>> 
>
> As I told you, look at the code. 
> The cycle length of Wichmann-Hill in R appears to be 2.8e13, and that also 
> appears to be M in your notation.  Your birthday problem calculation does not 
> apply here.
> You could probably get a good approximation to it by rounding the 
> Wichmann-Hill output (e.g. look at round(2^30 * runif()), or maybe more 
> severe rounding).
>
> M is larger than what was previously documented, but Brian Ripley has 
> corrected the documentation.

Wichmann claimed 2.78X10^13 in his 1982 original paper. He made correction 
in 1984, "The period of the generator was incorrectly given as 2.78 X 
10^13. In fact its period is only a quarter of that value: 6.95X10^12." 
In R version 2.7.1, Brian Ripley adopted Wichmann's 1984 correction in RNG 
document. That is the currently documented cycle is 6.95X10^12 not 
2.78X10^13 nor your approxmation 2.8e13.

So, is the cycle claimed in 1982 correct or the one claimed in 1984?

Even if the larger one 2.78e13 claimed in 1982 is correct,  around 45 
duplications were expected for 50M samples. But I got 0.  My testing 
method is based on the example code in the last third line in RNG help.

Shengqiao Li

>
> Duncan Murdoch
>> Regards,
>> 
>> Shengqiao Li
>> 
>> Department of Statistics
>> West Virgina Unversity
>> 
>> ==============Testing===================
>> 
>> 
>>> RNGkind(kind="Knuth-TAOCP");
>>> RNGkind();
>>> 
>> [1] "Knuth-TAOCP" "Inversion"
>> 
>>> sum(duplicated(runif(1e7)));
>>> 
>> [1] 46216
>> 
>>> RNGkind(kind="Knuth-TAOCP-2002");
>>> RNGkind();
>>> 
>> [1] "Knuth-TAOCP-2002" "Inversion"
>> 
>>> sum(duplicated(runif(1e7)));
>>> 
>> [1] 46373
>> 
>> 
>>> RNGkind(kind="Marsaglia-Multicarry");
>>> RNGkind();
>>> 
>> [1] "Marsaglia-Multicarry" "Inversion"
>> 
>>> sum(duplicated(runif(1e7)));
>>> 
>> [1] 11671
>> 
>>> RNGkind(kind="Super-Duper");
>>> RNGkind();
>>> 
>> [1] "Super-Duper" "Inversion"
>> 
>>> sum(duplicated(runif(1e7)));
>>> 
>> [1] 11704
>> 
>>> RNGkind(kind="Mersenne-Twister");
>>> RNGkind();
>>> 
>> [1] "Mersenne-Twister" "Inversion"
>> 
>>> sum(duplicated(runif(1e7)));
>>> 
>> [1] 11508
>> 
>>> RNGkind(kind="Wichmann-Hill");
>>> RNGkind();
>>> 
>> [1] "Wichmann-Hill" "Inversion"
>> 
>>> sum(duplicated(runif(1e7)));
>>> 
>> [1] 0
>> 
>> 
>>> gc()
>>> 
>>            used (Mb) gc trigger  (Mb)  max used   (Mb)
>> Ncells  199157  5.4     646480  17.3  10149018  271.1
>> Vcells 4497151 34.4  108714629 829.5 169585390 1293.9
>> 
>> 
>>> sum(duplicated(runif(5e7)))
>>> 
>> [1] 0
>> 
>> 
>> 
>>> sessionInfo()
>>> 
>> R version 2.7.1 (2008-06-23)
>> i386-pc-mingw32
>> 
>> locale:
>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
>> States.1252;LC_MONETARY=English_United 
>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> loaded via a namespace (and not attached):
>> [1] tools_2.7.1
>> 
>> ==============End of Testing===================
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>
>



More information about the R-help mailing list