[Rd] [R] RNG Cycle and Duplication (PR#12540)

Duncan Murdoch murdoch at stats.uwo.ca
Fri Aug 15 07:58:29 CEST 2008


shli at stat.wvu.edu wrote:
>   This message is in MIME format.  The first part should be readable text,
>   while the remaining parts are likely unreadable without MIME-aware tools.
>
> ---559023410-851401618-1218751024=:15885
> Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: QUOTED-PRINTABLE
>
>
> I didn't describe the problem clearly. It's about the number of distinct=20
> values. So just ignore cycle issue.
>
> My tests were:
>
> RNGkind(kind=3D"Knuth-TAOCP");
> sum(duplicated(runif(1e7))); #return 46552
>
> RNGkind(kind=3D"Knuth-TAOCP-2002");
> sum(duplicated(runif(1e7))); #return 46415
>
> #These collision frequency suggested there were 2^30 distinct values by=20
> birthday problem.
>   

The birthday problem distribution applies to independent draws, but they 
are only pseudo-independent.  I think the only ways to know for sure if 
there are 2^30 values are to look at the code, or run through a complete 
cycle.  And you need to determine the cycle by looking at .Random.seed, 
not at the returned value.
>
> RNGkind(kind=3D"Marsaglia-Multicarry");
> sum(duplicated(runif(1e7))); #return 11682
>
> RNGkind(kind=3D"Super-Duper");
> sum(duplicated(runif(1e7))); #return 11542
>
> RNGkind(kind=3D"Mersenne-Twister");
> sum(duplicated(runif(1e7))); #return 11656
>
> #These indicated there were 2^32 distinct values, which agrees with the=20
> help info.
>
>   

If there are 2^30 distinct values for the two generators above, that 
also agrees with the documentation.

> RNGkind(kind=3D"Wichmann-Hill");
> sum(duplicated(runif(1e7))); #return 0
>
> #So for this method, there should be more than 2^32 distinct values.
>
> You may not get the exact numbers, but they should be close. So how to=20
> explain above problem?
>   

You haven't demonstrated what you claim, but if you look at the source, 
you'll see that in fact the man page is wrong:  Wichmann-Hill is based 
on 3 integer values, which each take on approximately 15 bits of 
different values.  So Wichmann-Hill could take nearly 2^45 different 
values (actually 30269*30307*30323).

The source is in https://svn.r-project.org/R/trunk/src/main/RNG.c if you 
want to check the others.
> I need generate a large sample without any ties, it seems to me=20
> "Wichmann-Hill" is only choice right now.
>   

An alternative would be to construct a new value from two (or more) 
runif() values, but be careful that you don't mess up the distribution 
when you do that.

Duncan Murdoch
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Shengqiao Li
>
> The Department of Statistics
> PO Box 6330
> West Virginia University
> Morgantown, WV 26506-6330
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> On Thu, 14 Aug 2008, Peter Dalgaard wrote:
>
>   
>> Shengqiao Li wrote:
>>     
>>> Hello all,
>>> =20
>>> I am generating large samples of random numbers. The RNG help page says:=
>>>       
> =20
>   
>>> "All the supplied uniform generators return 32-bit integer values that a=
>>>       
> re=20
>   
>>> converted to doubles, so they take at most 2^32 distinct values and long=
>>>       
> =20
>   
>>> runs will return duplicated values." But I find that the cycles are not =
>>>       
> the=20
>   
>>> same as the 32-bit integer.
>>> =20
>>> My test indicated that the cycles for Knuth's methods were 2^30 while=20
>>> Wichmann-Hill's cycle was larger than 2^32! No numbers were duplicated i=
>>>       
> n=20
>   
>>> 10M numbers generated by runif using Wichmann-Hill. The other three meth=
>>>       
> ods=20
>   
>>> had cycle length of 2^32.
>>> =20
>>> So, anybody can explain this? And any improvement to the implementation =
>>>       
> can=20
>   
>>> be made to increase the cycle length like the Wichmann-Hill method?
>>> =20
>>>       
>> What test? These are not simple linear congruential generators. Just beca=
>>     
> use=20
>   
>> you get the same value twice, it doesn't mean that the sequence is repeat=
>>     
> ing.=20
>   
>> Perhaps you should read the entire help page rather than just the note.
>>
>> --=20
>>  O__  ---- Peter Dalgaard             =D8ster Farimagsgade 5, Entr.B
>> c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>> (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
>>
>>
>>     
> ---559023410-851401618-1218751024=:15885--
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list