[R] Zipf random number generation
Weiguang Shi
wgshi2001 at yahoo.ca
Tue Jan 25 22:53:47 CET 2005
Thanks very much Ted for the detailed explanation.
It might be flawed but some "common practices" in my
area use the following approach to generate a random
rank for some Zipf-like distribution with the
parameter alpha.
The key is to introduce the limit of ranks, N. With N
and alpha, therefore, the probability of rank i is
determined by
c/pow((double) i, alpha)
where
c = sum(1.0 / pow((double) i, alpha))
Any comment on that?
Thanks,
Weiguang
--- Ted.Harding at nessie.mcc.ac.uk wrote:
> On 25-Jan-05 Weiguang Shi wrote:
> > Hi,
> >
> > Is there a Zipf-like distribution RNG in R?
> >
> > Thanks,
> > Weiguang
>
> "Zipf's Law" (as originally formulated in studies of
> the
> frequencies of words in texts) is to the effect that
> the
> relative frequencies with which words occur once,
> twice,
> 3 times, ... are in proportion to 1/1, 1/2, 1/3, ...
> ,
> 1/n, ... with no limit on n (i.e. the number of
> different
> words each represented n times is proportional to
> 1/n).
>
> This is "improper" since sum(1/n) = infinity, so
> does not
> define a probability distribution. A respectable
> analogue
> is Fisher's logarithmic distribution p(x) where
>
> p(x) = ((t^x)/x)/(-log(1-t)), x = 1, 2, 3, ...
>
> where t in (0,1) is the parameter of the
> distribution.
>
> As t -> 1, p(x+1)/p(x) -> x/(x+1) as in Zipf's Law.
>
> However, I've searched the R site and have found
> only
> one instance of a function directly related to this
> distribution, namely logff() in package VGAM, which
> is for estimating the parameter t.
>
> So it looks as though there is no direct
> implementation
> of something like "rlogdist".
>
> However, the logarithmic distribution is a limiting
> form of the negative binomial distribution
> (conditional
> on a positive value), and there are functions in R
> for
> random sampling from this distribution.
>
> From my reading of "?rnbinom" in the base package,
> this
> function does not have the flexibitility you would
> need
> for this. But in the MASS package there is the
> function
> rnegbin() which does.
>
> You would need to invoke
>
> rnegbin(N, mu=..., theta=... )
>
> where the value ... of mu is large and the value ...
> of
> theta is small, and reject cases with value zero.
> This
> of course makes it awkward to generate a sample of
> given
> size.
>
> Alternatively, you can try along the lines of
>
> > x0<-1:10000; t<-0.999; p<-((t^x0)/x0)/(-log(1-t))
> > Y<-sample(x0,5000,replace=TRUE,prob=p)
> > hist(Y,breaks=100)
>
> While this gives the logarithmic distribution over
> the
> range of x in x0, it is inexact in that it does not
> permit values greater than max(x0) to be sampled.
>
> No doubt others can suggest something better than
> this!
>
> Best wishes,
> Ted.
>
>
>
--------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861 [NB: New number!]
> Date: 25-Jan-05
> Time: 10:20:57
> ------------------------------ XFMail
> ------------------------------
>
More information about the R-help
mailing list