[R] svy / weighted regression

Sat Oct 10 11:02:46 CEST 2009

Sorry, forgot to "reply all"...

Laust wrote:
> Dear list,
> 
> I am trying to set up a propensity-weighted regression using the
> survey package. Most of my population is sampled with a sampling
> probability of one (that is, I have the full population). However, for
> a subset of the data I have only a 50% sample of the full population.
> In previous work on the data, I analyzed these data using SAS and
> STATA. In those packages I used a propensity weight of 1/[sampling
> probability] in various generalized linear regression-procedures, but
> I am having trouble setting this up. I bet the solution is simple, but
> I’m a R newbie. Code to illustrate my problem below.

Hi Laust,

You probably need the package author to explain fully, but as far as I
can see, the crux is that a dispersion parameter is being used, based on
Pearson residuals, even in the Poisson case (i.e. you effectively get
the same result as with quasipoisson()).

I don't know what the rationale is for this, but it is clear that with
your data, an estimated dispersion parameter is going to be large. E.g.
the data has both 0 cases in 750000 person-years and 1000 cases in 5000
person-years for Denmark, and in your model they are supposed to have
the same Poisson rate.

summary.svyglm starts off with

     est.disp <- TRUE

and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
there is probably a perfectly good reason not to just set the dispersion
to 1, but I don't get it either...

> 
> Thanks
> Laust
> 
> # loading survey
> library(survey)
> 
> # creating data
> listc <- c("Denmark","Finland","Norway","Sweden","Denmark","Finland","Norway","Sweden")
> listw <- c(1,2,1,1,1,1,1,1)
> listd <- c(0,0,0,0,1000,1000,1000,2000)
> listt <- c(750000,500000,900000,1900000,5000,5000,5000,10000)
> list.cwdt <- c(listc, listw, listd, listt)
> country <- data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)
> 
> # running a frequency weighted regression to get the correct point
> estimates for comparison
> glm <- glm(deaths ~ country + offset(log(yrs_at_risk)),
> weights=weight, data=country, family=poisson())
> summary(glm)
> regTermTest(glm, ~ country)
> 
> # running survey weighted regression
> svy <- svydesign(~0,,data=country, weight=~weight)
> svyglm <- svyglm(deaths ~ country + offset(log(yrs_at_risk)),
> design=svy, data=country, family=poisson())
> summary(svyglm)
> # point estimates are correct, but standard error is way too large
> regTermTest(svyglm, ~ country)
> # test indicates no country differences
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907