[Rd] efficiency of sample() with prob.

Mon Aug 29 20:17:03 CEST 2005

Peter Windridge and I investigated this further.  It seems the 
distribution used for your tests is maximally favourable to your proposal 
(not uncommon in papers, but not very honest).  There are examples in 
which the existing method is substantially faster than Walker's (when each 
set of prob is used only a few times and there are many sets) and only a 
few areas in which there are large gains to be had.  For example, with 
binomial(N=10000, p=0.25) Walker's method is not twice as fast, and 10 
million samples only take a second or two.  (Tabulating them takes quite a 
bit longer.)  OTOH, with a distribution concentrated on 5 out of 10000 
points, the existing method is twice as fast.

Changing how this is done will break the reproducibility of past programs, 
and we don't really want to introduce yet more options.  So it seems only 
worth doing when there are substantial speed gains.  I am still tuning 
this, but it seems that it is worthwhile if the number of probs >= 0.1/N 
is more than 200 or so.  The advantage of Walker's method is that its 
speed is more-or-less independent of probs (unless they are actually all 
identical!), but equally that can be a disadvantage.

On Thu, 23 Jun 2005, Bo Peng wrote:

>> We suggest that you take up your own suggestion, research this area and
>> offer the R project the results of your research for consideration as your
>> contribution.
>
> I implemented Walker's alias method and re-compiled R. Here is what
> I did:

[...]

>
>> b=sample(seq(1,100), prob=seq(1,100), replace=TRUE, size=1000000)
>> table(b)/1000000*sum(seq(1,100))
>
> 4. run the following code in current R 2.1.0 and updated R.
>
> for(prob in seq(1,4)){
>  for(sample in seq(1,4)){
>    x = seq(1:(10^prob))   # short to long x
>    p = abs(rnorm(length(x)))  # prob vector
>    times = 10^(6-prob)   # run shorter cases more times
>    Rprof(paste("sample_", prob, "_", sample, ".prof", sep=''))
>    for(t in seq(1,times)){
>      sample(x, prob=p, size=10^sample, replace=TRUE )
>    }
>    Rprof(NULL)
>  }
> }
>
> Basically, I tried to test the performance of sample(replace=TRUE, prob=..)
> with different length of x and size.
>
> 5. process the profiles and here is the result.
> p: length of prob and x
> size: size of sample
> cell: execution time of old/updated sample()
>
>  size\p    10          10^2        10^3       10^4
>  10       2.4/1.6      0.32/0.22   0.20/0.08  0.24/0.06
>  10^2     3.1/2.6      0.48/0.28   0.28/0.06  0.30/0.06
>  10^3     11.8/11.1    1.84/1.14   0.94/0.18  0.96/0.08
>  10^4     96.8/96.6    15.34/9.68  7.54/1.06  7.48/0.16
>  run:     10000        1000        100        10 times
>
> We can see that the alias method is quicker than the linear search
> method in all cases. The performance difference is greatest (>50 times)
> when the original algorithm need to search in a long prob vector.
>
> I have not thoroughly tested the new function. I will do so if you
> (the developers) think that this has the potential to be incorporated
> into R.
>
> Thanks.
>
> Bo Peng
> Department of Statistics
> Rice University
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595