[Rd] efficiency of sample() with prob.
Bo Peng
ben.bob at gmail.com
Thu Jun 23 23:37:35 CEST 2005
> We suggest that you take up your own suggestion, research this area and
> offer the R project the results of your research for consideration as your
> contribution.
I implemented Walker's alias method and re-compiled R. Here is what
I did:
1. replace function ProcSampleReplace in R-2.1.0/src/main/random.c
with the following one
static void ProbSampleReplace(int n, double *p, int *perm, int nans, int *ans)
{
/* allocate memory for a, p and HL */
double * q = Calloc(n, double);
int * a = Calloc(n, int);
int * HL = Calloc(n, int);
int * H = HL;
int * L = HL+n-1;
int i, j, k;
double rU; /* U[0,1)*n */
/* set up alias table */
/* initialize q with n*p0,...n*p_n-1 */
for(i=0; i<n; ++i)
q[i] = p[i]*n;
/* initialize a with indices */
for(i=0; i<n; ++i)
a[i] = i;
/* set up H and L */
for(i=0; i<n; ++i) {
if( q[i] >= 1.)
*H++ = i;
else
*L-- = i;
}
while( H != HL && L != HL+n-1) {
j = *(L+1);
k = *(H-1);
a[j] = k;
q[k] += q[j] - 1;
L++; /* remove j from L */
if( q[k] < 1. ) {
*L-- = k; /* add k to L */
--H; /* remove k */
}
}
/* generate sample */
for (i = 0; i < nans; ++i) {
rU = unif_rand() * n;
k = (int)(rU);
rU -= k; /* rU becomes rU-[rU] */
if( rU < q[k] )
ans[i] = k+1;
else
ans[i] = a[k]+1;
}
Free(HL);
Free(a);
Free(q);
}
2. make and make install
3. test the new sample function by code like
> b=sample(seq(1,100), prob=seq(1,100), replace=TRUE, size=1000000)
> table(b)/1000000*sum(seq(1,100))
4. run the following code in current R 2.1.0 and updated R.
for(prob in seq(1,4)){
for(sample in seq(1,4)){
x = seq(1:(10^prob)) # short to long x
p = abs(rnorm(length(x))) # prob vector
times = 10^(6-prob) # run shorter cases more times
Rprof(paste("sample_", prob, "_", sample, ".prof", sep=''))
for(t in seq(1,times)){
sample(x, prob=p, size=10^sample, replace=TRUE )
}
Rprof(NULL)
}
}
Basically, I tried to test the performance of sample(replace=TRUE, prob=..)
with different length of x and size.
5. process the profiles and here is the result.
p: length of prob and x
size: size of sample
cell: execution time of old/updated sample()
size\p 10 10^2 10^3 10^4
10 2.4/1.6 0.32/0.22 0.20/0.08 0.24/0.06
10^2 3.1/2.6 0.48/0.28 0.28/0.06 0.30/0.06
10^3 11.8/11.1 1.84/1.14 0.94/0.18 0.96/0.08
10^4 96.8/96.6 15.34/9.68 7.54/1.06 7.48/0.16
run: 10000 1000 100 10 times
We can see that the alias method is quicker than the linear search
method in all cases. The performance difference is greatest (>50 times)
when the original algorithm need to search in a long prob vector.
I have not thoroughly tested the new function. I will do so if you
(the developers) think that this has the potential to be incorporated
into R.
Thanks.
Bo Peng
Department of Statistics
Rice University
More information about the R-devel
mailing list