[R] svy / weighted regression

Tue Oct 13 12:07:25 CEST 2009

Dear David,

Thanks again for your input! I realize that I did a bad job of
explaining this in my first email, but the setup is that in Finland
persons who die are sampled with a different probability (1) from
those who live (.5). This was done by the Finnish data protection
authorities to protect individuals against identification. In the rest
of the countries everyone is sampled with a probability of 1. The data
that I am supplying to R is summarized data for each country
stratified by case status. Another way of organizing the data would
be:

# creating data
listc <- c("Denmark","Finland","Norway","Sweden")
listw <- c(1,2,1,1)
listd <- c(1000,1000,1000,2000)
listt <- c(755000,505000,905000,1910000)
list.cwdt <- c(listc, listw, listd, listt)
country2 <- data.frame(country=listc,weight=listw,deaths=listd,time=listt)

I hope that it is clearer now that for no value of the independent
variable 'country' is the rate going to be zero. I think this was also
not the case in my original example, but this was obscured by my poor
communication- & R-skills. But if data is organized this way then
sampling weight of 2 for Finland should only be applied to the
time-variable that contains person years at risk and *not* to the
number of deaths, which would complicate matters further. I would know
how to get this to work in R or in any other statistical package.
Perhaps it is - as Peter Dalgaard suggested - the estimation of the
dispersion parameter by the survey package that is causing trouble,
not the data example eo ipso. Or perhaps I am just using survey in a
wrong way.

Best
Laust

****
Post doc. Laust Mortensen, PhD
Epidemiology Unit
University of Southern Denmark

On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius <dwinsemius at comcast.net> wrote:
> I think you are missing the point. You have 4 zero death counts associated
> with much higher person years of exposure followed by 4 death counts in the
> thousands associated with lower degrees of exposures. It seems unlikely that
> these are real data as there are not cohorts that would exhibit such lower
> death-rates. So it appears that in setting up your test case, you have
> created an impossibly unrealistic test problem.
>
> --
> David
>
>
> On Oct 12, 2009, at 9:12 AM, Laust wrote:
>
>> Dear Peter,
>>
>> Thanks for the input. The zero rates in some strata occurs because
>> sampling depended on case status: In Finland only 50% of the non-cases
>> were sampled, while all others were sampled with 100% probability.
>>
>> Best
>> Laust
>>
>> On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
>> <p.dalgaard at biostat.ku.dk> wrote:
>>>
>>> Sorry, forgot to "reply all"...
>>>
>>> Laust wrote:
>>>>
>>>> Dear list,
>>>>
>>>> I am trying to set up a propensity-weighted regression using the
>>>> survey package. Most of my population is sampled with a sampling
>>>> probability of one (that is, I have the full population). However, for
>>>> a subset of the data I have only a 50% sample of the full population.
>>>> In previous work on the data, I analyzed these data using SAS and
>>>> STATA. In those packages I used a propensity weight of 1/[sampling
>>>> probability] in various generalized linear regression-procedures, but
>>>> I am having trouble setting this up. I bet the solution is simple, but
>>>> I’m a R newbie. Code to illustrate my problem below.
>>>
>>> Hi Laust,
>>>
>>> You probably need the package author to explain fully, but as far as I
>>> can see, the crux is that a dispersion parameter is being used, based on
>>> Pearson residuals, even in the Poisson case (i.e. you effectively get
>>> the same result as with quasipoisson()).
>>>
>>> I don't know what the rationale is for this, but it is clear that with
>>> your data, an estimated dispersion parameter is going to be large. E.g.
>>> the data has both 0 cases in 750000 person-years and 1000 cases in 5000
>>> person-years for Denmark, and in your model they are supposed to have
>>> the same Poisson rate.
>>>
>>> summary.svyglm starts off with
>>>
>>>   est.disp <- TRUE
>>>
>>> and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
>>> there is probably a perfectly good reason not to just set the dispersion
>>> to 1, but I don't get it either...
>>>
>>>>
>>>> Thanks
>>>> Laust
>>>>
>>>> # loading survey
>>>> library(survey)
>>>>
>>>> # creating data
>>>> listc <-
>>>>
>>>> c("Denmark","Finland","Norway","Sweden","Denmark","Finland","Norway","Sweden")
>>>> listw <- c(1,2,1,1,1,1,1,1)
>>>> listd <- c(0,0,0,0,1000,1000,1000,2000)
>>>> listt <- c(750000,500000,900000,1900000,5000,5000,5000,10000)
>>>> list.cwdt <- c(listc, listw, listd, listt)
>>>> country <-
>>>> data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)
>>>>
>>>> # running a frequency weighted regression to get the correct point
>>>> estimates for comparison
>>>> glm <- glm(deaths ~ country + offset(log(yrs_at_risk)),
>>>> weights=weight, data=country, family=poisson())
>>>> summary(glm)
>>>> regTermTest(glm, ~ country)
>>>>
>>>> # running survey weighted regression
>>>> svy <- svydesign(~0,,data=country, weight=~weight)
>>>> svyglm <- svyglm(deaths ~ country + offset(log(yrs_at_risk)),
>>>> design=svy, data=country, family=poisson())
>>>> summary(svyglm)
>>>> # point estimates are correct, but standard error is way too large
>>>> regTermTest(svyglm, ~ country)
>>>> # test indicates no country differences
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> --
>>>  O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>>>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>>>  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
>>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
>