[R] Survey - Cluster Sampling
Thomas Lumley
tlumley at u.washington.edu
Thu Jun 16 17:01:08 CEST 2005
On Thu, 16 Jun 2005, Mark Hempelmann wrote:
> Dear WizaRds,
>
> I am struggling to compute correctly a cluster sampling design. I want
> to do one stage clustering with different parametric changes:
>
> Let M be the total number of clusters in the population, and m the
> number sampled. Let N be the total of elements in the population and n
> the number sampled. y are the values sampled. This is my example data:
>
> clus1 <- data.frame(cluster=c(1,1,1,2,2,2,3,3,3), id=seq(1:3,3),
> weight=rep(72/9,9), nl=rep(3,9), Nl=rep(3,9), N=rep(72,9), y=c(23,33,77,
> 25,35,74, 27,37,72) )
>
> 1. Let M=m=3 and N=n=9. Then:
>
> dclus1<-svydesign(id=~cluster, data=clus1)
> svymean(~y, dclus1)
>
> mean SE
> y 44.778 0.294, the unweighted mean, assuming equal probability in the
> clusters. ok.
Yes.
> 2. Let M=23, m=3 and N=72, n=9, then I am unable to use svydesign correctly:
>
> dclus2<-svydesign(id=~cluster, data=clus1, fpc=~N)
> svymean(~y, dclus2)
>
> mean SE
> y 44.778 0.2878, but it should be 23/72 * 1/3(133+134+136)=42.91, since
> I have to include the total number of clusters/total population M/N into
> the estimator. How can I include the information of the total number of
> clusters?
The fpc term should be the total number of clusters, so 23 rather than 72.
clus1$M<-rep(23,9)
dclus2a<-svydesign(id=~cluster, data=clus1, fpc=~M)
svymean(~y, dclus2a)
Now, this still gives 44.778, because each observation still has the same
weight. It describes a one-stage cluster sampling design where each
cluster has only three elements. This is an equal-probability sampling
design. Any equal-probability sampling design will give the same estimated
mean.
If your design was to take a simple random sample of three clusters and
then take all the elements in each cluster then dclus2a is giving the
correct mean (well, the one I wanted it to give). Estimates of the
population total will be different, but not the mean.
Your expected estimate of the mean is also a reasonable one. In survey
statistics there is often more than one reasonable estimator even for
something as simple as the mean. My estimator is
sum(weights*y)/sum(weights), which has some practical advantages: it is
easy to generalise to more complex designs (including things like
post-stratification), it can be computed without knowing the sampling
design (which is important when using replicate weights to compute
variances), it is the definition of the mean that agrees with linear
regression models, and it is what Stata uses, making it easier to compare
results.
Your estimator uses the expected value of the denominator rather than the
observed value. This probably implies that your estimator is
design-unbiased and mine isn't. Since there aren't design-unbiased
estimators for most statistics more complicated than the mean I don't
worry so much about it.
You might also have had a sampling design where you took a simple random
sample of three clusters and then up to three elements from each cluster.
dclus2b<-svydesign(id=~cluster+id, fpc=~M+nl, data=clus1)
This gives the same mean as dclus2a, because in fact you sampled 100% of
each sampled cluster.
> 3. How do I work with weights correctly? I understand that weights imply
> inverse probability weighting 1/p with p=n/N in simple sampling, in
> our case 72/9=8, because I sample 9 units out of a total population of
> 72. Again, I couldn't tell survey the number of total clusters M. So:
>
> dclus3<-svydesign(id=~cluster, weights=~weight, data=clus1, fpc=~N)
> svymean(~y, dclus3)
>
> mean SE
> y 44.778 0.2878, still exactly the same numbers, although I provided the
> weights. What am I doing wrong?
Again, fpc should be M rather than N. The help page says that the relevant
population size is in "sampling units" (ie, clusters). It used to say PSUs
before the package was extended to handle multistage fpcs, which was
probably clearer but now wouldn't be true.
Apart from that you aren't doing anything wrong. The mean should still be
the same as the unweighted mean because you are giving each observation
the same weight. And it is.
The total won't be the same as dclus2a and dclus2b, because you are now
telling R the population size in elements as well as in PSUs.
-thomas
More information about the R-help
mailing list