[R] Zipf random number generation

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Tue Jan 25 11:20:57 CET 2005


On 25-Jan-05 Weiguang Shi wrote:
> Hi,
> 
> Is there a Zipf-like distribution RNG in R?
> 
> Thanks,
> Weiguang

"Zipf's Law" (as originally formulated in studies of the
frequencies of words in texts) is to the effect that the
relative frequencies with which words occur once, twice,
3 times, ... are in proportion to 1/1, 1/2, 1/3, ... ,
1/n, ... with no limit on n (i.e. the number of different
words each represented n times is proportional to 1/n).

This is "improper" since sum(1/n) = infinity, so does not
define a probability distribution. A respectable analogue
is Fisher's logarithmic distribution p(x) where

  p(x) = ((t^x)/x)/(-log(1-t)), x = 1, 2, 3, ...

where t in (0,1) is the parameter of the distribution.

As t -> 1, p(x+1)/p(x) -> x/(x+1) as in Zipf's Law.

However, I've searched the R site and have found only
one instance of a function directly related to this
distribution, namely logff() in package VGAM, which
is for estimating the parameter t.

So it looks as though there is no direct implementation
of something like "rlogdist".

However, the logarithmic distribution is a limiting
form of the negative binomial distribution (conditional
on a positive value), and there are functions in R for
random sampling from this distribution.

>From my reading of "?rnbinom" in the base package, this
function does not have the flexibitility you would need
for this. But in the MASS package there is the function
rnegbin() which does.

You would need to invoke

  rnegbin(N, mu=..., theta=... )

where the value ... of mu is large and the value ... of
theta is small, and reject cases with value zero. This
of course makes it awkward to generate a sample of given
size.

Alternatively, you can try along the lines of

> x0<-1:10000; t<-0.999; p<-((t^x0)/x0)/(-log(1-t))
> Y<-sample(x0,5000,replace=TRUE,prob=p)
> hist(Y,breaks=100)

While this gives the logarithmic distribution over the
range of x in x0, it is inexact in that it does not
permit values greater than max(x0) to be sampled.

No doubt others can suggest something better than this!

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 25-Jan-05                                       Time: 10:20:57
------------------------------ XFMail ------------------------------




More information about the R-help mailing list