[R] Exactly Replicating Stata's Survey Data Confidence Intervals in R
Thomas Lumley
tlumley at uw.edu
Tue Sep 25 03:26:31 CEST 2012
In the first and third examples it looks as though confint(svymean())
matches the totals and svyciprop(method="logit") matches the
proportions, which is what Stata says
(http://www.stata.com/statalist/archive/2006-10/msg01127.html). The
agreement isn't perfect for the counts in the first example, but since
Stata's default numeric type is single-precision, it's accurate
enough.
In the second example the issue is that svyciprop(method="logit")
doesn't take the design degrees of freedom into account, and Stata
does.
m<-svyglm(I(val==1)~1,family=quasibinomial,design=y)
expit<-function(eta) exp(eta)/(1+exp(eta))
expit(coef(m)+qt(0.975,99)*SE(m))
(Intercept)
0.2971778536
expit(coef(m)-qt(0.975,99)*SE(m))
(Intercept)
0.1287770109
I'll add the df= argument to svyciprop(method="logit"), and then it
should all match.
-thomas
On Tue, Sep 25, 2012 at 5:56 AM, Anthony Damico <ajdamico at gmail.com> wrote:
> Hi Dr. Lumley, you're obviously correct about all of that. Thank you for
> cluing me into it! And sorry for overlooking that part of the
> documentation.
>
> I'm unfortunately still struggling with matching numbers exactly, and I
> foolishly provided a dataset without a weight variable - thinking there was
> only one issue preventing R from matching Stata precisely. Adding the df=
> to the confint() call seems to have only solved half of the problem.
>
> If you (or anyone else) still has any energy to look at this issue, I've
> pasted three analysis examples with both Stata and R code that aim to
> calculate a survey-adjusted confidence interval.
>
> All three examples match coefficient and standard error values precisely,
> for both weighted counts and percents. In the first example, the confidence
> intervals don't match for either the counts or the percents. In the second,
> the CIs match for counts but not percents. In the third, CIs match for
> both. Because of this erratic behavior on relatively straightforward
> datasets, I'm worried that Stata is making some weird (non-reproducible)
> calculations that would obviously be outside of the scope of this list.
>
> Thanks!!
>
>
> # # # # # # # #
> # example 1 #
> # # # # # # # #
>
> # simple example of confidence intervals not matching between stata and R
> # this example doesn't match CIs for counts or percents
>
> options( digits = 10 )
> library(foreign)
> library(survey)
> x <- read.dta(
> "http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop.dta" )
> x$pw <- 6194/310
> y <- svydesign( ~1 , data = x , weights = ~pw )
>
> svytotal( ~factor( stype ) , y )
> confint( svytotal( ~factor( stype ) , y ) , df = degf( y ) )
>
> svymean( ~factor( stype ) , y )
> confint( svymean( ~factor( stype ) , y ) , df = degf( y ) )
>
>
>
>
>
> use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop,
> clear
> gen pw = 6194/310
> svyset [pweight=pw]
> svy: tabulate stype, count se ci format (%12.3f)
> svy: tabulate stype, percent se ci format (%12.3f)
>
>
> ----------------------------------------------------------
> stype | count se lb ub
> ----------+-----------------------------------------------
> E | 88334.428 710.843 86940.929 89727.927
> H | 15085.386 514.508 14076.773 16094.000
> M | 20340.296 582.814 19197.778 21482.814
>
>
> --------------------------------------------------------------
> stype | percentages se lb ub
> ----------+---------------------------------------------------
> E | 71.376 0.574 70.236 72.488
> H | 12.189 0.416 11.397 13.028
> M | 16.435 0.471 15.533 17.379
>
>
>
>
> # # # # # # # #
> # example 2 #
> # # # # # # # #
>
>
>
> # simple example of confidence intervals not matching between stata and R
> # this example does match CIs for counts but does not match for percents
>
> options(digits=10)
> library(foreign)
> library(survey)
>
> # create a sample data frame with 100 records, containing values 1 - 5 and
> weights 1 or 2
> x <- data.frame( case = 1:100 , val = 1:5 , weight = 1:2 )
> # immediately save it externally to be read into stata for comparison
> write.dta( x , "c:/my directory/hundred record example.dta" )
>
> # create a simple survey design, with weights only
> y <- svydesign( ~1 , data = x , weights = ~weight )
>
> # the confidence interval for the weighted counts do match stata
> svytotal( ~I( val == 1 ) , y )
> confint( svytotal( ~I( val == 1 ) , y ) , df = degf( y ) )
>
> # generate the mean & standard error -- these two match precisely..
> svymean( ~I( val == 1 ) , y )
>
> # # # # # # # # # # #
>
> # ..but none of the confidence interval attempts match
> confint( svymean( ~I( val == 1 ) , y ) )
> confint( svymean( ~I( val == 1 ) , y ) , df = degf( y ) )
> svyciprop(~I( val == 1 ) , y, method="li" , df = degf( y ) )
> svyciprop(~I( val == 1 ) , y, method="lo", df = degf( y ) )
> svyciprop(~I( val == 1 ) , y, method="as", df = degf( y ) )
> svyciprop(~I( val == 1 ) , y, method="be", df = degf( y ) )
>
> # # # # # # # # # # #
>
>
> **************************************
>
> * run the R code first to create this dta file
> use c:\my directory\hundred record example.dta , clear
> svyset [pweight=weight]
> svy: tabulate val, count se ci
> svy: tabulate val, percent se ci
>
> Number of strata = 1 Number of obs =
> 100
> Number of PSUs = 100 Population size =
> 150
> Design df =
> 99
>
> ----------------------------------------------------------
> val | count se lb ub
> ----------+-----------------------------------------------
> 1 | 30 6.435 17.23 42.77
> 2 | 30 6.435 17.23 42.77
> 3 | 30 6.435 17.23 42.77
> 4 | 30 6.435 17.23 42.77
> 5 | 30 6.435 17.23 42.77
>
>
> Number of strata = 1 Number of obs =
> 100
> Number of PSUs = 100 Population size =
> 150
> Design df =
> 99
>
> --------------------------------------------------------------
> val | percentages se lb ub
> ----------+---------------------------------------------------
> 1 | 20 4.238 12.88 29.72
> 2 | 20 4.238 12.88 29.72
> 3 | 20 4.238 12.88 29.72
> 4 | 20 4.238 12.88 29.72
> 5 | 20 4.238 12.88 29.72
>
>
>
>
>
> # # # # # # # #
> # example 3 #
> # # # # # # # #
>
>
> # this example matches CIs for both
>
> library(foreign)
> library(survey)
>
> x <- read.dta( "http://www.stata-press.com/data/r11/nhanes2f.dta" )
> y <- svydesign( ~1 , data = x , weights = ~finalwgt )
> svymean( ~sex , y )
> confint( svymean( ~sex , y ) , df = degf( y ) )
>
> mean SE
> sexMale 0.47958 0.0059
> sexFemale 0.52042 0.0059
> 2.5 % 97.5 %
> sexMale 0.4681058 0.4910512
> sexFemale 0.5089488 0.5318942
>
> **************************************
>
> use http://www.stata-press.com/data/r11/nhanes2f, clear
> svyset [pweight=finalwgt]
> svy: tabulate sex, percent se ci
> (running tabulate on estimation sample)
>
> Number of strata = 1 Number of obs =
> 10337
> Number of PSUs = 10337 Population size =
> 117023659
> Design df =
> 10336
>
> --------------------------------------------------------------
> 1=male, |
> 2=female | percentages se lb ub
> ----------+---------------------------------------------------
> Male | 47.96 .5853 46.81 49.11
> Female | 52.04 .5853 50.89 53.19
>
>
>
>
>
>
> On Sun, Sep 23, 2012 at 4:30 PM, Thomas Lumley <tlumley at uw.edu> wrote:
>>
>> On Sat, Sep 22, 2012 at 2:51 AM, Anthony Damico <ajdamico at gmail.com>
>> wrote:
>>
>> > Survey: Mean estimation
>> >
>> > Number of strata = 1 Number of obs = 183
>> > Number of PSUs = 15 Population size = 9235.4
>> > Design df = 14
>> >
>> > --------------------------------------------------------------
>> > | Linearized
>> > | Mean Std. Err. [95% Conf. Interval]
>> > -------------+------------------------------------------------
>> > ell0 | .0218579 .0226225 -.0266624 .0703783
>> > --------------------------------------------------------------
>>
>> This matches
>>
>> > svyciprop(~I(ell==0),dclus1,df=14,method="mean")
>> 2.5% 97.5%
>> I(ell == 0) 0.0218579 -0.0266624 0.07038
>>
>> as does this
>>
>> > confint(svymean(~I(ell==0),dclus1),df=14)
>> 2.5 % 97.5 %
>> I(ell == 0)FALSE 0.92962174505 1.02666240796
>> I(ell == 0)TRUE -0.02666240796 0.07037825495
>>
>> The df= argument is not explicitly documented in ?svyciprop, but it is
>> in ?confint.svystat and ?svymean
>>
>> [I was slowed down a bit by the claim that the Stata intervals were
>> asymmetric, but in fact they aren't]
>>
>> -thomas
>>
>> --
>> Thomas Lumley
>> Professor of Biostatistics
>> University of Auckland
>
>
--
Thomas Lumley
Professor of Biostatistics
University of Auckland
More information about the R-help
mailing list