[R] Overdispersion with binomial distribution
Ben Bolker
bolker at ufl.edu
Tue Feb 17 19:54:27 CET 2009
Thanks for the clarification.
I actually had MASS open to that page while
I was composing my reply but forgot to mention
it (trying to do too many things at once) ...
Ben Bolker
Prof Brian Ripley wrote:
> On Tue, 17 Feb 2009, Ben Bolker wrote:
>
>> Jessica L Hite/hitejl/O/VCU <hitejl <at> vcu.edu> writes:
>>
>>> I am attempting to run a glm with a binomial model to analyze proportion
>>> data.
>>> I have been following Crawley's book closely and am wondering if there is
>>> an accepted standard for how much is too much overdispersion? (e.g. change
>>> in AIC has an accepted standard of 2).
>
>> In principle, in the null case (i.e. data are really binomial)
>> the deviance is chi-squared distributed with the df equal
>> to the residual df.
>
> *Approximately*, provided the expected counts are not near or below
> one. See MASS §7.5 for an analysis of the size of the approximation
> errors (which can be large and in both directions).
>
> Given that I once had a consulting job where the over-dispersion was
> causing something close ot panic and was entirely illusory, the lack
> of the 'approximately' can have quite serious consequences.
>
>> For example:
>>
>> example(glm)
>> deviance(glm.D93) ## 5.13
>> summary(glm.D93)$dispersion ## 1 (by definition)
>> dfr <- df.residual(glm.D93)
>> deviance(glm.D93)/dfr ## 1.28
>> d2 <- sum(residuals(glm.D93,"pearson")^2) ## 5.17
>> (disp2 <- d2/dfr) ## 1.293
>>
>> gg2 <- update(glm.D93,family=quasipoisson)
>> summary(gg2)$dispersion ## 1.293, same as above
>>
>> pchisq(d2,df=dfr,lower.tail=FALSE)
>>
>> all.equal(coef(glm.D93),coef(gg2)) ## TRUE
>>
>> se1 <- coef(summary(glm.D93))[,"Std. Error"]
>> se2 <- coef(summary(gg2))[,"Std. Error"]
>> se2/se1
>>
>> # (Intercept) outcome2 outcome3 treatment2 treatment3
>> # 1.137234 1.137234 1.137234 1.137234 1.137234
>>
>> sqrt(disp2)
>> # [1] 1.137234
>>
>>> My code and output are below, given the example in the book, these data are
>>> WAY overdispersed .....do I mention this and go on or does this signal the
>>> need to try a different model? If so, any suggestions on the type of
>>> distribution (gamma or negative binomial ?)?
>> Way overdispersed may indicate model lack of fit. Have
>> you examined residuals/data for outliers etc.?
>>
>> quasibinomial should be fine, or you can try beta-binomial
>> (see the aod package) ...
>>
>>
>>> attach(Clutch2)
>>> y<-cbind(Total,Size-Total)
>>> glm1<-glm(y~Pred,"binomial")
>>> summary(glm1)
>>>
>>> Call:
>>> glm(formula = y ~ Pred, family = "binomial")
>>>
>>> Deviance Residuals:
>>> Min 1Q Median 3Q Max
>>> -9.1022 -2.7899 -0.4781 2.6058 8.4852
>>>
>>> Coefficients:
>>> Estimate Std. Error z value Pr(>|z|)
>>> (Intercept) 1.35095 0.06612 20.433 < 2e-16 ***
>>> PredF -0.34811 0.11719 -2.970 0.00297 **
>>> PredSN -3.29156 0.10691 -30.788 < 2e-16 ***
>>> PredW -1.46451 0.12787 -11.453 < 2e-16 ***
>>> PredWF -0.56412 0.13178 -4.281 1.86e-05 ***
>>> ---
>>> #### the output for residual deviance and df does not change even when I
>>> use quasibinomial, is this ok? #####
>> That's as expected.
>>
>>> library(MASS)
>> you don't really need MASS for quasibinomial.
>>
>>>> glm2<-glm(y~Pred,"quasibinomial")
>>>> summary(glm2)
>>> Call:
>>> glm(formula = y ~ Pred, family = "quasibinomial")
>>>
>>> Deviance Residuals:
>>> Min 1Q Median 3Q Max
>>> -9.1022 -2.7899 -0.4781 2.6058 8.4852
>>>
>>> Coefficients:
>>> Estimate Std. Error t value Pr(>|t|)
>>> (Intercept) 1.3510 0.2398 5.633 1.52e-07 ***
>>> PredF -0.3481 0.4251 -0.819 0.41471
>>> PredSN -3.2916 0.3878 -8.488 1.56e-13 ***
>>> PredW -1.4645 0.4638 -3.157 0.00208 **
>>> PredWF -0.5641 0.4780 -1.180 0.24063
>>> ---
>>> Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>>>
>>> (Dispersion parameter for quasibinomial family taken to be 13.15786)
>>>
>>> Null deviance: 2815.5 on 108 degrees of freedom
>>> Residual deviance: 1323.5 on 104 degrees of freedom
>>> (3 observations deleted due to missingness)
>>> AIC: NA
>>>
>>> Number of Fisher Scoring iterations: 5
>>>
>>>> anova(glm2,test="F")
>>> Analysis of Deviance Table
>>>
>>> Model: quasibinomial, link: logit
>>>
>>> Response: y
>>>
>>> Terms added sequentially (first to last)
>>>
>>> Df Deviance Resid. Df Resid. Dev F Pr(>F)
>>> NULL 108 2815.5
>>> Pred 4 1492.0 104 1323.5 28.349 6.28e-16 ***
>>> ---
>>> Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>>>> model1<-update(glm2,~.-Pred)
>>>> anova(glm2,model1,test="F")
>>> Analysis of Deviance Table
>>>
>>> Model 1: y ~ Pred
>>> Model 2: y ~ 1
>>> Resid. Df Resid. Dev Df Deviance F Pr(>F)
>>> 1 104 1323.5
>>> 2 108 2815.5 -4 -1492.0 28.349 6.28e-16 ***
>>> ---
>>> Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>>>> coef(glm2)
>>> (Intercept) PredF PredSN PredW PredWF
>>> 1.3509550 -0.3481096 -3.2915601 -1.4645097 -0.5641223
>>>
>>> Thanks
>>> Jessica
>>> hitejl <at> vcu.edu
>
>
--
Ben Bolker
Associate professor, Biology Dep't, Univ. of Florida
bolker at ufl.edu / www.zoology.ufl.edu/bolker
GPG key: www.zoology.ufl.edu/bolker/benbolker-publickey.asc
More information about the R-help
mailing list