[R] Overdispersion with binomial distribution

Tue Feb 17 18:53:42 CET 2009

Jessica L Hite/hitejl/O/VCU <hitejl <at> vcu.edu> writes:

> I am attempting to run a glm with a binomial model to analyze proportion
> data.
> I have been following Crawley's book closely and am wondering if there is
> an accepted standard for how much is too much overdispersion? (e.g. change
> in AIC has an accepted standard of 2).

  In principle, in the null case (i.e. data are really binomial)
the deviance is  chi-squared distributed with the df equal
to the residual df.

  For example:

example(glm)
deviance(glm.D93) ## 5.13
summary(glm.D93)$dispersion ## 1 (by definition)
dfr <- df.residual(glm.D93)
deviance(glm.D93)/dfr ## 1.28
d2 <- sum(residuals(glm.D93,"pearson")^2) ## 5.17
(disp2 <- d2/dfr)  ## 1.293

gg2 <- update(glm.D93,family=quasipoisson)
summary(gg2)$dispersion  ## 1.293, same as above

pchisq(d2,df=dfr,lower.tail=FALSE)

all.equal(coef(glm.D93),coef(gg2)) ## TRUE

se1 <- coef(summary(glm.D93))[,"Std. Error"]
se2 <- coef(summary(gg2))[,"Std. Error"]
se2/se1

# (Intercept)    outcome2    outcome3  treatment2  treatment3 
#   1.137234    1.137234    1.137234    1.137234    1.137234 

sqrt(disp2)
# [1] 1.137234

> My code and output are below, given the example in the book, these data are
> WAY overdispersed .....do I mention this and go on or does this signal the
> need to try a different model? If so, any suggestions on the type of
> distribution (gamma or negative binomial ?)?

  Way overdispersed may indicate model lack of fit.  Have
you examined residuals/data for outliers etc.?  

  quasibinomial should be fine, or you can try beta-binomial
(see the aod package) ...

> attach(Clutch2)
>  y<-cbind(Total,Size-Total)
> glm1<-glm(y~Pred,"binomial")
> summary(glm1)
> 
> Call:
> glm(formula = y ~ Pred, family = "binomial")
> 
> Deviance Residuals:
>     Min       1Q   Median       3Q      Max
> -9.1022  -2.7899  -0.4781   2.6058   8.4852
> 
> Coefficients:
>             Estimate Std. Error z value Pr(>|z|)
> (Intercept)  1.35095    0.06612  20.433  < 2e-16 ***
> PredF       -0.34811    0.11719  -2.970  0.00297 **
> PredSN      -3.29156    0.10691 -30.788  < 2e-16 ***
> PredW       -1.46451    0.12787 -11.453  < 2e-16 ***
> PredWF      -0.56412    0.13178  -4.281 1.86e-05 ***
> ---
> #### the output for residual deviance and df does not change even when I
> use quasibinomial, is this ok?  #####

  That's as expected.

>  library(MASS)

  you don't really need MASS for quasibinomial.

> > glm2<-glm(y~Pred,"quasibinomial")
> > summary(glm2)
> 
> Call:
> glm(formula = y ~ Pred, family = "quasibinomial")
> 
> Deviance Residuals:
>     Min       1Q   Median       3Q      Max
> -9.1022  -2.7899  -0.4781   2.6058   8.4852
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)   1.3510     0.2398   5.633 1.52e-07 ***
> PredF        -0.3481     0.4251  -0.819  0.41471
> PredSN       -3.2916     0.3878  -8.488 1.56e-13 ***
> PredW        -1.4645     0.4638  -3.157  0.00208 **
> PredWF       -0.5641     0.4780  -1.180  0.24063
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> 
> (Dispersion parameter for quasibinomial family taken to be 13.15786)
> 
>     Null deviance: 2815.5  on 108  degrees of freedom
> Residual deviance: 1323.5  on 104  degrees of freedom
>   (3 observations deleted due to missingness)
> AIC: NA
> 
> Number of Fisher Scoring iterations: 5
> 
> > anova(glm2,test="F")
> Analysis of Deviance Table
> 
> Model: quasibinomial, link: logit
> 
> Response: y
> 
> Terms added sequentially (first to last)
> 
>       Df Deviance Resid. Df Resid. Dev      F   Pr(>F)
> NULL                    108     2815.5
> Pred   4   1492.0       104     1323.5 28.349 6.28e-16 ***
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> > model1<-update(glm2,~.-Pred)
> > anova(glm2,model1,test="F")
> Analysis of Deviance Table
> 
> Model 1: y ~ Pred
> Model 2: y ~ 1
>   Resid. Df Resid. Dev  Df Deviance      F   Pr(>F)
> 1       104     1323.5
> 2       108     2815.5  -4  -1492.0 28.349 6.28e-16 ***
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> > coef(glm2)
> (Intercept)       PredF      PredSN       PredW      PredWF
>   1.3509550  -0.3481096  -3.2915601  -1.4645097  -0.5641223
> 
> Thanks
> Jessica
> hitejl <at> vcu.edu