[R] Sampling

Thu Feb 7 18:34:08 CET 2008

On Wed, 6 Feb 2008, Tim Hesterberg wrote:

>> Tim Hesterberg wrote:
>>> I'll raise a related issue - sampling with unequal probabilities,
>>> without replacement.  R does the wrong thing, in my opinion:
>>> ...
>> Peter Dalgaard wrote:
>> But is that the right thing? ...
> (See bottom for more of the previous messages.)
>
>
> First, consider the common case, where size * max(prob) < 1 --
> sampling with unequal probabilities without replacement.
>
> Why do people do sampling with unequal probabilities, without
> replacement?  A typical application would be sampling with probability
> proportional to size, or more generally where the desire is that
> selection probabilities match some criterion.

In real survey PPS sampling it also matters what the pairwise joint 
selection probabilities are -- and there are *many* algorithms, with 
different properties. Yves Till'e has written an R package that implements 
some of them, and the pps package implements others.

> The default S-PLUS algorithm does that.  The selection probabilities
> at each of step 1, 2, ..., size are all equal to prob, and the overall
> probabilities of selection are size*prob.

Umm, no, they aren't.

Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to 
take a sample of size 10 from a population of size 10 with unequal 
sampling probabilities.  The overall selection probability *can't* be 
anything other than 1 for each element -- sampling without replacement and 
proportional to any other set of  probabilities is impossible.

Even in a milder case -- samples of size 5 from 1:10 with probabilities 
proportional to 1:10 -- the deviation is noticeable in 1000 replications. 
In this case sampling with the specified probabilities is actually 
possible, but S-PLUS doesn't do it.

Now, it might be useful to add another replace=FALSE sampler to sample(), 
such as the newish Conditional Poisson Sampler based on the work of 
S.X.Chen.  This does give correct marginal probabilities of inclusion, and 
the pairwise joint probabilities are not too hard to compute.

I don't think that dropping the current sequential PPS implementation is 
a good idea. The help page does explain the algorithm, though it might be 
useful to add an explicit note that the marginal probabilities of sampling 
are not the supplied probabilities.

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle