[R] Survey - Cluster Sampling

Thu Jun 16 17:01:08 CEST 2005

On Thu, 16 Jun 2005, Mark Hempelmann wrote:

> Dear WizaRds,
>
> 	I am struggling to compute correctly a cluster sampling design. I want
> to do one stage clustering with different parametric changes:
>
> Let M be the total number  of clusters in the population, and m the
> number sampled. Let N be the total of elements in the population and n
> the number sampled. y are the values sampled. This is my example data:
>
> clus1 <- data.frame(cluster=c(1,1,1,2,2,2,3,3,3), id=seq(1:3,3),
> weight=rep(72/9,9), nl=rep(3,9), Nl=rep(3,9), N=rep(72,9), y=c(23,33,77,
> 25,35,74, 27,37,72) )
>
> 1. Let M=m=3 and N=n=9. Then:
>
> dclus1<-svydesign(id=~cluster,  data=clus1)
> svymean(~y, dclus1)
>
>     mean    SE
> y 44.778 0.294, the unweighted mean, assuming equal probability in the
> clusters. ok.

Yes.

> 2. Let M=23, m=3 and N=72, n=9, then I am unable to use svydesign correctly:
>
> dclus2<-svydesign(id=~cluster,  data=clus1, fpc=~N)
> svymean(~y, dclus2)
>
>     mean     SE
> y 44.778 0.2878, but it should be 23/72 * 1/3(133+134+136)=42.91, since
> I have to include the total number of clusters/total population M/N into
> the estimator. How can I include the information of the total number of
> clusters?

The fpc term should be the total number of clusters, so 23 rather than 72.
clus1$M<-rep(23,9)
dclus2a<-svydesign(id=~cluster, data=clus1, fpc=~M)
svymean(~y, dclus2a)

Now, this still gives 44.778, because each observation still has the same 
weight.  It describes a one-stage cluster sampling design where each 
cluster has only three elements.  This is an equal-probability sampling 
design. Any equal-probability sampling design will give the same estimated 
mean.

If your design was to take a simple random sample of three clusters and 
then take all the elements in each cluster then dclus2a is giving the 
correct mean (well, the one I wanted it to give). Estimates of the 
population total will be different, but not the mean.

Your expected estimate of the mean is also a reasonable one. In survey 
statistics there is often more than one reasonable estimator even for 
something as simple as the mean.  My estimator is 
sum(weights*y)/sum(weights), which has some practical advantages: it is 
easy to generalise to more complex designs (including things like 
post-stratification), it can be computed without knowing the sampling 
design (which is important when using replicate weights to compute 
variances), it is the definition of the mean that agrees with linear 
regression models, and it is what Stata uses, making it easier to compare 
results.

Your estimator uses the expected value of the denominator rather than the 
observed value. This probably implies that your estimator is 
design-unbiased and mine isn't.  Since there aren't design-unbiased 
estimators for most statistics more complicated than the mean I don't 
worry so much about it.

You might also have had a sampling design where you took a simple random 
sample of three clusters and then up to three elements from each cluster.
   dclus2b<-svydesign(id=~cluster+id, fpc=~M+nl, data=clus1)
This gives the same mean as dclus2a, because in fact you sampled 100% of 
each sampled cluster.

> 3. How do I work with weights correctly? I understand that weights imply
>  inverse probability weighting 1/p with p=n/N in simple sampling, in
> our case 72/9=8, because I sample 9 units out of a total population of
> 72. Again, I couldn't tell survey the number of total clusters M. So:
>
> dclus3<-svydesign(id=~cluster,  weights=~weight, data=clus1, fpc=~N)
> svymean(~y, dclus3)
>
>     mean     SE
> y 44.778 0.2878, still exactly the same numbers, although I provided the
> weights. What am I doing wrong?

Again, fpc should be M rather than N. The help page says that the relevant 
population size is in "sampling units" (ie, clusters). It used to say PSUs 
before the package was extended to handle multistage fpcs, which was 
probably clearer but now wouldn't be true.

Apart from that you aren't doing anything wrong. The mean should still be 
the same as the unweighted mean because you are giving each observation 
the same weight. And it is.

The total won't be the same as dclus2a and dclus2b, because you are now 
telling R the population size in elements as well as in PSUs.

 	-thomas