[R] Survey and Stratification

Thomas Lumley tlumley at u.washington.edu
Thu May 26 22:04:32 CEST 2005

On Thu, 26 May 2005, Mark Hempelmann wrote:

> Dear WizaRds,
> 	Working through sampling theory, I tried to comprehend the concept of 
> stratification and apply it with Survey to a small example. My question is 
> more of theoretic nature, so I apologize if this does not fully fit this 
> board's intention, but I have come to a complete stop in my efforts and need 
> an expert to help me along. Please help:
> age<-matrix(c(rep(1,5), rep(2,3), 1:8, rep(3,5), rep(4,3), rep(5,5), 
> rep(3,3), rep(15,5), rep(12,3), 23,25,27,21,22, 33,27,29), ncol=6, byrow=F)
> colnames(age)<-c("stratum", "id", "weight", "nh", "Nh", "y")
> age<-as.data.frame(age)

Ok.  Assuming that Nh are the population sizes in each stratum, you have 
5/15 sampled in stratum 1 and 3/12 in stratum 2.

This can be specified in a number of ways
You can use
   sampling weights of 15/5 and 12/3
   sampling probabilities of 5/15 and 3/12
without or without specifiying the finite population correction. The 
finite population correction can be specified as 15 and 12 or 5/15 and 
3/12, and if the finite population correction is specified the weights are 
then optional.

   d1<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), data=age)
   d2<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), data=age)
give the with-replacement design (agreeing with your age.des3) and
   d3<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), fpc=~Nh,data=age)
   d4<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), fpc=~Nh,data=age)
   d5<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), fpc=~I(nh/Nh),data=age)
   d6<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), fpc=~I(nh/Nh),data=age)
   d7<-svydesign(ids=~id, strata=~stratum, fpc=~Nh,data=age)
   d8<-svydesign(ids=~id, strata=~stratum, fpc=~I(nh/Nh),data=age)
all give the without-replacement design. We get
> svymean(~y,d1)
     mean     SE
y 26.296 0.9862
> svymean(~y,d2)
     mean     SE
y 26.296 0.9862
> svymean(~y,d3)
     mean     SE
y 26.296 0.8364
> svymean(~y,d4)
     mean     SE
y 26.296 0.8364
> svymean(~y,d5)
     mean     SE
y 26.296 0.8364
> svymean(~y,d6)
     mean     SE
y 26.296 0.8364
> svymean(~y,d7)
     mean     SE
y 26.296 0.8364
> svymean(~y,d8)
     mean     SE
y 26.296 0.8364

Now, looking at your examples
> ## create survey design object
> age.des1<-svydesign(ids=~id, strata=~stratum, weight=~Nh, data=age)
> svymean(~y, age.des1)
> ## gives mean 25.568, SE 0.9257

This is wrong: the sampling weight is Nh/nh, not Nh

> age.des2<-svydesign(ids=~id, strata=~stratum, weight=~I(nh/Nh), data=age)
> svymean(~y, age.des2)
> ## gives mean 25.483, SE 0.9227

This is wrong: the sampling weight is Nh/nh. You need prob=~I(nh/Nh) to 
specify sampling fractions.

> age.des3<-svydesign(ids=~id, strata=~stratum, weight=~weight, data=age)
> svymean(~y, age.des3)
> ## gives mean 26.296, SE 0.9862

This is correct and agrees with d1 and d2

> age.des4<-svydesign(ids=~id, strata=~stratum, data=age)
> svymean(~y, age.des4)
> ## gives mean 25.875, SE 0.9437

This is a stratified, unweighted mean, ie mean(age$y).

> age.des3 is the only estimator I am able to compute per hand correctly. It is 
> stratified random sampling with inverse probablility weighting with weight= 
> nh/Nh ## sample size/ stratum size.
> Basically, I thought the option weight=~Nh as well as weight=~I(nh/Nh) would 
> result in the same number, but it does not.

No, it does not.  A weight of 3 is not the same as a weight of 1/3.  With 
the finite population correction it is safe to assume that numbers less 
than 1 are sampling fractions and numbers greater than 1 are population 
sizes, but this isn't safe when it comes to weights.  It is possible that 
someone could want to use sampling weights less than 1.

> I thought the Hansen-Hurwitz estimator per stratum offers the right numbers:
> p1=5/15, p2=3/12, so y1.total=1/5*(3*118), y2.total=1/3*(4*89) and the 
> stratified estimator with this design should be: 1/27(y1.total+y2.total), 
> obviously wrong.

Since this gives a mean of 7.01 for numbers around 25 it can't be right. 
You have divided by sample size twice. You should have
You then will get  (y1.total+y2.total)/27 to be 26.29630, in agreement 
with svymean().


More information about the R-help mailing list