[R] Zipf random number generation

Tue Jan 25 22:53:47 CET 2005

Thanks very much Ted for the detailed explanation.

It might be flawed but some "common practices" in my
area use the following approach to generate a random
rank for some Zipf-like distribution with the
parameter alpha.

The key is to introduce the limit of ranks, N. With N
and alpha, therefore, the probability of rank i is
determined by 
    c/pow((double) i, alpha)
where
    c = sum(1.0 / pow((double) i, alpha))

Any comment on that?

Thanks,
Weiguang

 --- Ted.Harding at nessie.mcc.ac.uk wrote: 
> On 25-Jan-05 Weiguang Shi wrote:
> > Hi,
> > 
> > Is there a Zipf-like distribution RNG in R?
> > 
> > Thanks,
> > Weiguang
> 
> "Zipf's Law" (as originally formulated in studies of
> the
> frequencies of words in texts) is to the effect that
> the
> relative frequencies with which words occur once,
> twice,
> 3 times, ... are in proportion to 1/1, 1/2, 1/3, ...
> ,
> 1/n, ... with no limit on n (i.e. the number of
> different
> words each represented n times is proportional to
> 1/n).
> 
> This is "improper" since sum(1/n) = infinity, so
> does not
> define a probability distribution. A respectable
> analogue
> is Fisher's logarithmic distribution p(x) where
> 
>   p(x) = ((t^x)/x)/(-log(1-t)), x = 1, 2, 3, ...
> 
> where t in (0,1) is the parameter of the
> distribution.
> 
> As t -> 1, p(x+1)/p(x) -> x/(x+1) as in Zipf's Law.
> 
> However, I've searched the R site and have found
> only
> one instance of a function directly related to this
> distribution, namely logff() in package VGAM, which
> is for estimating the parameter t.
> 
> So it looks as though there is no direct
> implementation
> of something like "rlogdist".
> 
> However, the logarithmic distribution is a limiting
> form of the negative binomial distribution
> (conditional
> on a positive value), and there are functions in R
> for
> random sampling from this distribution.
> 
> From my reading of "?rnbinom" in the base package,
> this
> function does not have the flexibitility you would
> need
> for this. But in the MASS package there is the
> function
> rnegbin() which does.
> 
> You would need to invoke
> 
>   rnegbin(N, mu=..., theta=... )
> 
> where the value ... of mu is large and the value ...
> of
> theta is small, and reject cases with value zero.
> This
> of course makes it awkward to generate a sample of
> given
> size.
> 
> Alternatively, you can try along the lines of
> 
> > x0<-1:10000; t<-0.999; p<-((t^x0)/x0)/(-log(1-t))
> > Y<-sample(x0,5000,replace=TRUE,prob=p)
> > hist(Y,breaks=100)
> 
> While this gives the logarithmic distribution over
> the
> range of x in x0, it is inexact in that it does not
> permit values greater than max(x0) to be sampled.
> 
> No doubt others can suggest something better than
> this!
> 
> Best wishes,
> Ted.
> 
> 
>
--------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
> Date: 25-Jan-05                                     
>  Time: 10:20:57
> ------------------------------ XFMail
> ------------------------------
>