[R] Efficient sampling from a discrete distribution in R

Tue Sep 4 07:24:39 CEST 2007

Thanks to you and Burwin for helping.  My simulation seems faster now
(btw, is there some easy way to time things in R?) but not as fast as
I was hoping.  Here's the top level function.  Any ideas on how to
make it faster?

# Sample from a Chinese Restaurant Process
#
# size: number of samples to draw
# a: alpha.  If large then there will tend to be more categories.
#
rchinese = function(size, a) {
  stopifnot(length(a) == 1)
  samples = c()
  new.category = 1
  for (i in 1:size) {
    denom = i - 1 + a
    p.new = a / denom
    ci =
      if (runif(1) < p.new) {
        new.category = new.category + 1;
        new.category - 1
      }
      else {
        counts = unname(table(factor(samples, levels = 1:(i-1))))
        pmf = counts / sum(counts)
        rdiscrete(1, pmf)
      }
    samples = c(samples, ci)
  }
  samples
}

On 9/3/07, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
> On Mon, 3 Sep 2007, Issac Trotts wrote:
>
> > Hello r-help,
> >
> > As far as I've seen, there is no function in R dedicated to sampling
> > from a discrete distribution with a specified mass function.  The
> > standard library doesn't come with anything called rdiscrete or rpmf,
> > and I can't find any such thing on the cheat sheet or in the
> > Probability Distributions chapter of _An Introduction to R_.  Googling
> > also didn't bring back anything.  So, here's my first attempt at a
> > solution.  I'm hoping someone here knows of a more efficient way.
>
> It's called sample().
>
> There are much more efficient algorithms than the one you used, and
> sample() sometimes uses one of them (Walker's alias method): see any good
> book on simulation (including my 'Stochastic Simulation, 1987).
>
> > # Sample from a discrete distribution with given probability mass function
> > rdiscrete = function(size, pmf) {
> >  stopifnot(length(pmf) > 1)
> >  cmf = cumsum(pmf)
> >  icmf = function(p) {
> >    min(which(p < cmf))
> >  }
> >  ps = runif(size)
> >  sapply(ps, icmf)
> > }
> >
> > test.rdiscrete = function(N = 10000) {
> >  err.tol = 6.0 / sqrt(N)
> >  xs = rdiscrete(N, c(0.5, 0.5))
> >  err = abs(sum(xs == 1) / N - 0.5)
> >  stopifnot(err < err.tol)
> >  list(e = err, xs = xs)
> > }
> >
> > Thanks,
> > Issac
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>