[R] svy / weighted regression

David Winsemius dwinsemius at comcast.net
Mon Oct 12 15:32:15 CEST 2009


I think you are missing the point. You have 4 zero death counts  
associated with much higher person years of exposure followed by 4  
death counts in the thousands associated with lower degrees of  
exposures. It seems unlikely that these are real data as there are not  
cohorts that would exhibit such lower death-rates. So it appears that  
in setting up your test case, you have created an impossibly  
unrealistic test problem.

-- 
David


On Oct 12, 2009, at 9:12 AM, Laust wrote:

> Dear Peter,
>
> Thanks for the input. The zero rates in some strata occurs because
> sampling depended on case status: In Finland only 50% of the non-cases
> were sampled, while all others were sampled with 100% probability.
>
> Best
> Laust
>
> On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
> <p.dalgaard at biostat.ku.dk> wrote:
>> Sorry, forgot to "reply all"...
>>
>> Laust wrote:
>>>
>>> Dear list,
>>>
>>> I am trying to set up a propensity-weighted regression using the
>>> survey package. Most of my population is sampled with a sampling
>>> probability of one (that is, I have the full population). However,  
>>> for
>>> a subset of the data I have only a 50% sample of the full  
>>> population.
>>> In previous work on the data, I analyzed these data using SAS and
>>> STATA. In those packages I used a propensity weight of 1/[sampling
>>> probability] in various generalized linear regression-procedures,  
>>> but
>>> I am having trouble setting this up. I bet the solution is simple,  
>>> but
>>> I’m a R newbie. Code to illustrate my problem below.
>>
>> Hi Laust,
>>
>> You probably need the package author to explain fully, but as far  
>> as I
>> can see, the crux is that a dispersion parameter is being used,  
>> based on
>> Pearson residuals, even in the Poisson case (i.e. you effectively get
>> the same result as with quasipoisson()).
>>
>> I don't know what the rationale is for this, but it is clear that  
>> with
>> your data, an estimated dispersion parameter is going to be large.  
>> E.g.
>> the data has both 0 cases in 750000 person-years and 1000 cases in  
>> 5000
>> person-years for Denmark, and in your model they are supposed to have
>> the same Poisson rate.
>>
>> summary.svyglm starts off with
>>
>>    est.disp <- TRUE
>>
>> and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
>> there is probably a perfectly good reason not to just set the  
>> dispersion
>> to 1, but I don't get it either...
>>
>>>
>>> Thanks
>>> Laust
>>>
>>> # loading survey
>>> library(survey)
>>>
>>> # creating data
>>> listc <-
>>> c 
>>> ("Denmark 
>>> ","Finland","Norway","Sweden","Denmark","Finland","Norway","Sweden")
>>> listw <- c(1,2,1,1,1,1,1,1)
>>> listd <- c(0,0,0,0,1000,1000,1000,2000)
>>> listt <- c(750000,500000,900000,1900000,5000,5000,5000,10000)
>>> list.cwdt <- c(listc, listw, listd, listt)
>>> country <-
>>> data 
>>> .frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)
>>>
>>> # running a frequency weighted regression to get the correct point
>>> estimates for comparison
>>> glm <- glm(deaths ~ country + offset(log(yrs_at_risk)),
>>> weights=weight, data=country, family=poisson())
>>> summary(glm)
>>> regTermTest(glm, ~ country)
>>>
>>> # running survey weighted regression
>>> svy <- svydesign(~0,,data=country, weight=~weight)
>>> svyglm <- svyglm(deaths ~ country + offset(log(yrs_at_risk)),
>>> design=svy, data=country, family=poisson())
>>> summary(svyglm)
>>> # point estimates are correct, but standard error is way too large
>>> regTermTest(svyglm, ~ country)
>>> # test indicates no country differences
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> --
>>   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>>  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45)  
>> 35327918
>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45)  
>> 35327907
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list