[R] svy / weighted regression

Wed Oct 14 09:42:34 CEST 2009

Dear Thomas & David

That makes sense! If I wanted to use survey on the summarized data, I
suppose that I could 'de-summarize' or 're-individualize' the data to
give the design object the correct information on the number of
observations. Or I could revert to using the actual individual-level
data.

Thanks a lot, your input has been very helpful.
Laust

****

Post doc. Laust Mortensen, PhD
Epidemiology Unit
University of Southern Denmark

2009/10/13 Thomas Lumley <tlumley at u.washington.edu>:
>
> I think there is a much simpler explanation.
>
> The survey design object has eight observations, two per country.   With a sample size of two per country it is hardly surprising that country-specific estimates are not very precise. The actual data has hundreds of thousands  of observations per country, so it will have more precise estimates.
>
> Grouping the data doesn't make a difference for model-based glm estimation, where it is simply a computational convenience. It *does* make a difference for design-based estimation, because it changes the design.
>
>         -thomas
>
>
> On Tue, 13 Oct 2009, Laust wrote:
>
>> Dear David,
>>
>> Thanks again for your input! I realize that I did a bad job of
>> explaining this in my first email, but the setup is that in Finland
>> persons who die are sampled with a different probability (1) from
>> those who live (.5). This was done by the Finnish data protection
>> authorities to protect individuals against identification. In the rest
>> of the countries everyone is sampled with a probability of 1. The data
>> that I am supplying to R is summarized data for each country
>> stratified by case status. Another way of organizing the data would
>> be:
>>
>> # creating data
>> listc <- c("Denmark","Finland","Norway","Sweden")
>> listw <- c(1,2,1,1)
>> listd <- c(1000,1000,1000,2000)
>> listt <- c(755000,505000,905000,1910000)
>> list.cwdt <- c(listc, listw, listd, listt)
>> country2 <- data.frame(country=listc,weight=listw,deaths=listd,time=listt)
>>
>> I hope that it is clearer now that for no value of the independent
>> variable 'country' is the rate going to be zero. I think this was also
>> not the case in my original example, but this was obscured by my poor
>> communication- & R-skills. But if data is organized this way then
>> sampling weight of 2 for Finland should only be applied to the
>> time-variable that contains person years at risk and *not* to the
>> number of deaths, which would complicate matters further. I would know
>> how to get this to work in R or in any other statistical package.
>> Perhaps it is - as Peter Dalgaard suggested - the estimation of the
>> dispersion parameter by the survey package that is causing trouble,
>> not the data example eo ipso. Or perhaps I am just using survey in a
>> wrong way.
>>
>> Best
>> Laust
>>
>> ****
>> Post doc. Laust Mortensen, PhD
>> Epidemiology Unit
>> University of Southern Denmark
>>
>> On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>>>
>>> I think you are missing the point. You have 4 zero death counts associated
>>> with much higher person years of exposure followed by 4 death counts in the
>>> thousands associated with lower degrees of exposures. It seems unlikely that
>>> these are real data as there are not cohorts that would exhibit such lower
>>> death-rates. So it appears that in setting up your test case, you have
>>> created an impossibly unrealistic test problem.
>>>
>>> --
>>> David
>>>
>>>
>>> On Oct 12, 2009, at 9:12 AM, Laust wrote:
>>>
>>>> Dear Peter,
>>>>
>>>> Thanks for the input. The zero rates in some strata occurs because
>>>> sampling depended on case status: In Finland only 50% of the non-cases
>>>> were sampled, while all others were sampled with 100% probability.
>>>>
>>>> Best
>>>> Laust
>>>>
>>>> On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
>>>> <p.dalgaard at biostat.ku.dk> wrote:
>>>>>
>>>>> Sorry, forgot to "reply all"...
>>>>>
>>>>> Laust wrote:
>>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> I am trying to set up a propensity-weighted regression using the
>>>>>> survey package. Most of my population is sampled with a sampling
>>>>>> probability of one (that is, I have the full population). However, for
>>>>>> a subset of the data I have only a 50% sample of the full population.
>>>>>> In previous work on the data, I analyzed these data using SAS and
>>>>>> STATA. In those packages I used a propensity weight of 1/[sampling
>>>>>> probability] in various generalized linear regression-procedures, but
>>>>>> I am having trouble setting this up. I bet the solution is simple, but
>>>>>> I’m a R newbie. Code to illustrate my problem below.
>>>>>
>>>>> Hi Laust,
>>>>>
>>>>> You probably need the package author to explain fully, but as far as I
>>>>> can see, the crux is that a dispersion parameter is being used, based on
>>>>> Pearson residuals, even in the Poisson case (i.e. you effectively get
>>>>> the same result as with quasipoisson()).
>>>>>
>>>>> I don't know what the rationale is for this, but it is clear that with
>>>>> your data, an estimated dispersion parameter is going to be large. E.g.
>>>>> the data has both 0 cases in 750000 person-years and 1000 cases in 5000
>>>>> person-years for Denmark, and in your model they are supposed to have
>>>>> the same Poisson rate.
>>>>>
>>>>> summary.svyglm starts off with
>>>>>
>>>>>   est.disp <- TRUE
>>>>>
>>>>> and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
>>>>> there is probably a perfectly good reason not to just set the dispersion
>>>>> to 1, but I don't get it either...
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Laust
>>>>>>
>>>>>> # loading survey
>>>>>> library(survey)
>>>>>>
>>>>>> # creating data
>>>>>> listc <-
>>>>>>
>>>>>> c("Denmark","Finland","Norway","Sweden","Denmark","Finland","Norway","Sweden")
>>>>>> listw <- c(1,2,1,1,1,1,1,1)
>>>>>> listd <- c(0,0,0,0,1000,1000,1000,2000)
>>>>>> listt <- c(750000,500000,900000,1900000,5000,5000,5000,10000)
>>>>>> list.cwdt <- c(listc, listw, listd, listt)
>>>>>> country <-
>>>>>> data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)
>>>>>>
>>>>>> # running a frequency weighted regression to get the correct point
>>>>>> estimates for comparison
>>>>>> glm <- glm(deaths ~ country + offset(log(yrs_at_risk)),
>>>>>> weights=weight, data=country, family=poisson())
>>>>>> summary(glm)
>>>>>> regTermTest(glm, ~ country)
>>>>>>
>>>>>> # running survey weighted regression
>>>>>> svy <- svydesign(~0,,data=country, weight=~weight)
>>>>>> svyglm <- svyglm(deaths ~ country + offset(log(yrs_at_risk)),
>>>>>> design=svy, data=country, family=poisson())
>>>>>> summary(svyglm)
>>>>>> # point estimates are correct, but standard error is way too large
>>>>>> regTermTest(svyglm, ~ country)
>>>>>> # test indicates no country differences
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>> --
>>>>>  O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>>>>>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>>>>>  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
>>>>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
>>>>>
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> David Winsemius, MD
>>> Heritage Laboratories
>>> West Hartford, CT
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Thomas Lumley                   Assoc. Professor, Biostatistics
> tlumley at u.washington.edu        University of Washington, Seattle
>
>
>