[Rd] [R] RNG Cycle and Duplication (PR#12540)

Duncan Murdoch murdoch at stats.uwo.ca
Sat Aug 16 02:50:42 CEST 2008


On 15/08/2008 10:28 AM, Shengqiao Li wrote:
> Thank you for your reply and for your suggestion. So the note in man page 
> could be more accurate since for an end user, man page should be more 
> helpful and source code is mainly for developers.
> 
> I was also adviced to use Knuth's  double version RANARRAY from
> http://www-cs-faculty.stanford.edu/~knuth/programs.html instead of the 
> integer versions in R. I'm a R user. So why not 
> also include the double verion in R implementation?

You can try it using kind="user-supplied" if you like, but I suspect 
it's the same as "Knuth-TAOCP-2002".

Duncan Murdoch

> 
> Thanks again,
> 
> ========================================
> Shengqiao Li
> 
> Research Associate
> The Department of Statistics
> PO Box 6330
> West Virginia University
> Morgantown, WV 26506-6330
> 
> ========================================
> 
> On Fri, 15 Aug 2008, Duncan Murdoch wrote:
> 
>> shli at stat.wvu.edu wrote:
>>>   This message is in MIME format.  The first part should be readable text,
>>>   while the remaining parts are likely unreadable without MIME-aware tools.
>>>
>>> ---559023410-851401618-1218751024=:15885
>>> Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
>>> Content-Transfer-Encoding: QUOTED-PRINTABLE
>>>
>>>
>>> I didn't describe the problem clearly. It's about the number of distinct=20
>>> values. So just ignore cycle issue.
>>>
>>> My tests were:
>>>
>>> RNGkind(kind=3D"Knuth-TAOCP");
>>> sum(duplicated(runif(1e7))); #return 46552
>>>
>>> RNGkind(kind=3D"Knuth-TAOCP-2002");
>>> sum(duplicated(runif(1e7))); #return 46415
>>>
>>> #These collision frequency suggested there were 2^30 distinct values by=20
>>> birthday problem.
>>>
>> The birthday problem distribution applies to independent draws, but they are 
>> only pseudo-independent.  I think the only ways to know for sure if there are 
>> 2^30 values are to look at the code, or run through a complete cycle.  And 
>> you need to determine the cycle by looking at .Random.seed, not at the 
>> returned value.
>>> RNGkind(kind=3D"Marsaglia-Multicarry");
>>> sum(duplicated(runif(1e7))); #return 11682
>>>
>>> RNGkind(kind=3D"Super-Duper");
>>> sum(duplicated(runif(1e7))); #return 11542
>>>
>>> RNGkind(kind=3D"Mersenne-Twister");
>>> sum(duplicated(runif(1e7))); #return 11656
>>>
>>> #These indicated there were 2^32 distinct values, which agrees with the=20
>>> help info.
>>>
>>>
>> If there are 2^30 distinct values for the two generators above, that also 
>> agrees with the documentation.
>>
>>> RNGkind(kind=3D"Wichmann-Hill");
>>> sum(duplicated(runif(1e7))); #return 0
>>>
>>> #So for this method, there should be more than 2^32 distinct values.
>>>
>>> You may not get the exact numbers, but they should be close. So how to=20
>>> explain above problem?
>>>
>> You haven't demonstrated what you claim, but if you look at the source, 
>> you'll see that in fact the man page is wrong:  Wichmann-Hill is based on 3 
>> integer values, which each take on approximately 15 bits of different values. 
>> So Wichmann-Hill could take nearly 2^45 different values (actually 
>> 30269*30307*30323).
>>
>> The source is in https://svn.r-project.org/R/trunk/src/main/RNG.c if you want 
>> to check the others.
>>> I need generate a large sample without any ties, it seems to me=20
>>> "Wichmann-Hill" is only choice right now.
>>>
>> An alternative would be to construct a new value from two (or more) runif() 
>> values, but be careful that you don't mess up the distribution when you do 
>> that.
>>
>> Duncan Murdoch
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> Shengqiao Li
>>>
>>> The Department of Statistics
>>> PO Box 6330
>>> West Virginia University
>>> Morgantown, WV 26506-6330
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>
>>> On Thu, 14 Aug 2008, Peter Dalgaard wrote:
>>>
>>>
>>>> Shengqiao Li wrote:
>>>>
>>>>> Hello all,
>>>>> =20
>>>>> I am generating large samples of random numbers. The RNG help page says:=
>>>>>
>>> =20
>>>
>>>>> "All the supplied uniform generators return 32-bit integer values that a=
>>>>>
>>> re=20
>>>
>>>>> converted to doubles, so they take at most 2^32 distinct values and long=
>>>>>
>>> =20
>>>
>>>>> runs will return duplicated values." But I find that the cycles are not =
>>>>>
>>> the=20
>>>
>>>>> same as the 32-bit integer.
>>>>> =20
>>>>> My test indicated that the cycles for Knuth's methods were 2^30 while=20
>>>>> Wichmann-Hill's cycle was larger than 2^32! No numbers were duplicated i=
>>>>>
>>> n=20
>>>
>>>>> 10M numbers generated by runif using Wichmann-Hill. The other three meth=
>>>>>
>>> ods=20
>>>
>>>>> had cycle length of 2^32.
>>>>> =20
>>>>> So, anybody can explain this? And any improvement to the implementation =
>>>>>
>>> can=20
>>>
>>>>> be made to increase the cycle length like the Wichmann-Hill method?
>>>>> =20
>>>>>
>>>> What test? These are not simple linear congruential generators. Just beca=
>>>>
>>> use=20
>>>
>>>> you get the same value twice, it doesn't mean that the sequence is repeat=
>>>>
>>> ing.=20
>>>
>>>> Perhaps you should read the entire help page rather than just the note.
>>>>
>>>> --=20
>>>>  O__  ---- Peter Dalgaard             =D8ster Farimagsgade 5, Entr.B
>>>> c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>>>> (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
>>>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
>>>>
>>>>
>>>>
>>> ---559023410-851401618-1218751024=:15885--
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list