[R] Survey and Stratification
Thomas Lumley
tlumley at u.washington.edu
Thu May 26 22:04:32 CEST 2005
On Thu, 26 May 2005, Mark Hempelmann wrote:
> Dear WizaRds,
>
> Working through sampling theory, I tried to comprehend the concept of
> stratification and apply it with Survey to a small example. My question is
> more of theoretic nature, so I apologize if this does not fully fit this
> board's intention, but I have come to a complete stop in my efforts and need
> an expert to help me along. Please help:
>
> age<-matrix(c(rep(1,5), rep(2,3), 1:8, rep(3,5), rep(4,3), rep(5,5),
> rep(3,3), rep(15,5), rep(12,3), 23,25,27,21,22, 33,27,29), ncol=6, byrow=F)
> colnames(age)<-c("stratum", "id", "weight", "nh", "Nh", "y")
> age<-as.data.frame(age)
Ok. Assuming that Nh are the population sizes in each stratum, you have
5/15 sampled in stratum 1 and 3/12 in stratum 2.
This can be specified in a number of ways
You can use
sampling weights of 15/5 and 12/3
sampling probabilities of 5/15 and 3/12
without or without specifiying the finite population correction. The
finite population correction can be specified as 15 and 12 or 5/15 and
3/12, and if the finite population correction is specified the weights are
then optional.
So
d1<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), data=age)
d2<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), data=age)
give the with-replacement design (agreeing with your age.des3) and
d3<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), fpc=~Nh,data=age)
d4<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), fpc=~Nh,data=age)
d5<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), fpc=~I(nh/Nh),data=age)
d6<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), fpc=~I(nh/Nh),data=age)
d7<-svydesign(ids=~id, strata=~stratum, fpc=~Nh,data=age)
d8<-svydesign(ids=~id, strata=~stratum, fpc=~I(nh/Nh),data=age)
all give the without-replacement design. We get
> svymean(~y,d1)
mean SE
y 26.296 0.9862
> svymean(~y,d2)
mean SE
y 26.296 0.9862
> svymean(~y,d3)
mean SE
y 26.296 0.8364
> svymean(~y,d4)
mean SE
y 26.296 0.8364
> svymean(~y,d5)
mean SE
y 26.296 0.8364
> svymean(~y,d6)
mean SE
y 26.296 0.8364
> svymean(~y,d7)
mean SE
y 26.296 0.8364
> svymean(~y,d8)
mean SE
y 26.296 0.8364
Now, looking at your examples
> ## create survey design object
> age.des1<-svydesign(ids=~id, strata=~stratum, weight=~Nh, data=age)
> svymean(~y, age.des1)
> ## gives mean 25.568, SE 0.9257
This is wrong: the sampling weight is Nh/nh, not Nh
> age.des2<-svydesign(ids=~id, strata=~stratum, weight=~I(nh/Nh), data=age)
> svymean(~y, age.des2)
> ## gives mean 25.483, SE 0.9227
This is wrong: the sampling weight is Nh/nh. You need prob=~I(nh/Nh) to
specify sampling fractions.
> age.des3<-svydesign(ids=~id, strata=~stratum, weight=~weight, data=age)
> svymean(~y, age.des3)
> ## gives mean 26.296, SE 0.9862
This is correct and agrees with d1 and d2
> age.des4<-svydesign(ids=~id, strata=~stratum, data=age)
> svymean(~y, age.des4)
> ## gives mean 25.875, SE 0.9437
This is a stratified, unweighted mean, ie mean(age$y).
> age.des3 is the only estimator I am able to compute per hand correctly. It is
> stratified random sampling with inverse probablility weighting with weight=
> nh/Nh ## sample size/ stratum size.
>
> Basically, I thought the option weight=~Nh as well as weight=~I(nh/Nh) would
> result in the same number, but it does not.
No, it does not. A weight of 3 is not the same as a weight of 1/3. With
the finite population correction it is safe to assume that numbers less
than 1 are sampling fractions and numbers greater than 1 are population
sizes, but this isn't safe when it comes to weights. It is possible that
someone could want to use sampling weights less than 1.
>
> I thought the Hansen-Hurwitz estimator per stratum offers the right numbers:
> p1=5/15, p2=3/12, so y1.total=1/5*(3*118), y2.total=1/3*(4*89) and the
> stratified estimator with this design should be: 1/27(y1.total+y2.total),
> obviously wrong.
Since this gives a mean of 7.01 for numbers around 25 it can't be right.
You have divided by sample size twice. You should have
y1.total<-3*118
y2.total<-4*89
You then will get (y1.total+y2.total)/27 to be 26.29630, in agreement
with svymean().
-thomas
More information about the R-help
mailing list