[R] Post Stratification
Thomas Lumley
tlumley at u.washington.edu
Mon Jun 19 23:17:49 CEST 2006
On Sun, 18 Jun 2006, Mark Hempelmann wrote:
> Dear WizaRds,
>
> having met some of you in person in Vienna, I think even more fondly
> of this community and hope to continue on this route. It was great
> talking with you and learning from you. Thank you. I am trying to work
> through an artificial example in post stratification. This is my dataset:
>
> library(survey)
> age <- data.frame(id=1:8, stratum=rep( c("S1","S2"),c(5,3)),
> weight=rep(c(3,4),c(5,3)), nh=rep(c(5,3),c(5,3)),
> Nh=rep(c(15,12),c(5,3)), y=c(23,25,27,21,22, 77,72,74) )
>
> pop.types <- table(stratum=age$stratum)
No, you need population totals here, that's the whole point of
post-stratification.
> age.post <- svydesign(ids=~1, strata=NULL, data=age, fpc=~Nh)
Since the sampling isn't stratified you can't give stratum-specific fpc.
This should give an error message. I will add one.
> post <- postStratify(design=age.post, strata=~stratum, population=pop.types)
Yes, but pop.types is defined wrongly
>
> pop.types <- data.frame(stratum = c("S1","S2"), Freq = c(15, 12))
This is the correct definition
> However, I compute:
> Nh=c(15,12); nh=c(5,3); sh=by(age$y, age$stratum, var); N=sum(Nh)
> # Mean estimator
> y.bar=by(age$y, age$stratum, mean) ## 23.6; 74.33
> estimator=1/N*sum(Nh*y.bar) ## 46.14815
> # Variance estimator
> vari=1/N^2*sum(Nh*(Nh-nh)*sh/nh)
> sqrt(vari) ## .7425903
>
> and with Taylor expansion .7750118
>
This is partly the problem with giving stratum-specific fpc and partly
because "survey" is using a less elegant but more general estimator (as is
often the case). The estimator is the one due to Valliant, described in
the reference on the help page.
The primary difference from the estimator you given is that R does not try
to compute finite population corrections for each post-stratum. Another
more subtle difference is that R's formula is estimating the conditional
variance, conditional on the estimated weights, rather than the
unconditional variance.
So, to do the computations correctly
> library(survey)
> age <- data.frame(id=1:8, stratum=rep( c("S1","S2"),c(5,3)),
+ weight=rep(c(3,4),c(5,3)), nh=rep(c(5,3),c(5,3)),
+ Nh=rep(c(15,12),c(5,3)), y=c(23,25,27,21,22, 77,72,74) )
>
> age$N<-rep(27,8)
>
> pop.types <- data.frame(stratum = c("S1","S2"), Freq = c(15, 12))
>
The sampling is unstratified from a population of size 27=15+12
> age.design <- svydesign(ids=~1, data=age, fpc=~N)
> age.post<-postStratify(age.design, strata=~stratum, population=pop.types)
>
> svymean(~y, age.design)
mean SE
y 42.625 7.8163
> svymean(~y, age.post)
mean SE
y 46.148 0.6737
>
Now, post-stratification adjusts the weights to sum to 15 and 12 for the
two post-strata, so the mean estimate is a weighted average of the two
post-stratum-specific means, weighted by 15/27 and 12/27. This is what you
found.
The variance is the variance of the mean of the residuals after
subtracting the post-stratum-specific mean. That is, we can get the
post-stratified estimates by
Adjust the weights
> age.design2 <- svydesign(ids=~1, data=age, fpc=~N,weight=~weight)
Create a 'residuals' variable
> age.design2<-update(age.design2, yres=y-ifelse(stratum=="S1",
23.6,74.33333333))
Post-stratified mean is 46.148
> svymean(~y, age.design2)
mean SE
y 46.148 8.2317
Standard error of post-stratified mean is 0.6737
> svymean(~yres, age.design2)
mean SE
yres 1.4815e-09 0.6737
The formula for the variance of the estimated total is
> (n/(n-1))*((N-sum(nh))/N)*sum(yr2*(Nh/nh)^2)
[1] 330.915
which is formula (11) of the Valliant reference. The variance of the
mean is the same thing divided by N^2:
> (n/(n-1))*((N-sum(nh))/N)*sum(yr2*(Nh/nh)^2)/(N*N)
[1] 0.45393
where yr2<-by(yres^2,age$stratum,sum). These agree exactly with R's
estimates
> vcov(svytotal(~y, design=age.post))
y
y 330.915
> vcov(svymean(~y, design=age.post))
y
y 0.45393
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list