[R-sig-ME] overdispersion with binomial data?

John Maindonald john.maindonald at anu.edu.au
Sun Feb 13 01:04:01 CET 2011


This is a useful clarification.  Jared might perhaps have said
"cannot be accounted for in the error term".

What one can do (in the binary glm model) is to fit group as a fixed 
effect, then use the deviance that is due to group to estimate
the overdispersion. Or one can use a glmm model with group
as a random effect.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On 13/02/2011, at 9:27 AM, John Maindonald wrote:

> "However, most people seem to ignore overdispersion estimates (chi-square/df) if they are less than about 1.5 or so as a practical matter."
> 
> If there is large uncertainty in the overdispersion estimate, then adjusting for the 
> overdispersion is to trade bias for that uncertainty.  If the argument is that the
> bias is preferable, p-values should be adjusted for the long-term (over multiple
> studies) bias.  For an overdispersion that averages out at around 1.5, a p-value
> that appears as 0.05 becomes, depending on degrees of freedom, around 0.1
> 
> Sure, the deviance and Pearson chi-square are commonly quite close.  The 
> preference for the Pearson chi-square, as against the mean deviance, is not 
> however arbitrary.  The reduced bias is, over multiple analyses, worth having.
> If the Poisson mean is small, or many of the binomial proportions are close to 0 
> or to 1, it is noticeable.
> 
> ---------
>> zp <- rpois(20, 0.25)
>> summary(glm(zp~1, family=quasipoisson))
> . . . .
> 
> (Dispersion parameter for quasipoisson family taken to be 0.7895812)
> 
>  Null deviance: 13.863  on 19  degrees of freedom
> Residual deviance: 13.863  on 19  degrees of freedom
> --------
> 
> Compare the mean chi-square = 0.73 = 13.86/19 with a Pearson chi-square
> estimate (as above) that equals 0.79
> 
> The preference for the Pearson chi-square, as against the mean deviance,
> is not arbitrary.
> 
> 
> John Maindonald             email: john.maindonald at anu.edu.au
> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
> Centre for Mathematics & Its Applications, Room 1194,
> John Dedman Mathematical Sciences Building (Building 27)
> Australian National University, Canberra ACT 0200.
> http://www.maths.anu.edu.au/~johnm
> 
> On 13/02/2011, at 5:16 AM, Robert A LaBudde wrote:
> 
>> I don't believe in tests generally, so I agree with your point 1) in principle.
>> 
>> However, most people seem to ignore overdispersion estimates (chi-square/df) if they are less than about 1.5 or so as a practical matter. But if you have a reasonable amount of data and get an effect of 10 as in the example given, minor issues as to a posteriori testing are irrelevant. Ignoring an apparent overdispersion of 10 does not seem sensible.
>> 
>> The question as to whether deviance or Pearson chi-square is used is a minor issue, as the two are almost invariably quite close anyway.
>> 
>> In the end, we must all agree that the model must include all important effects to be useful.
>> 
>> At 01:04 AM 2/12/2011, John Maindonald wrote:
>>> 1) Different types of residuals serve different purposes.
>>> 
>>> 2) I am of the school that thinks it misguided to use the
>>> results of a test for overdispersion to decide whether to
>>> model it.  If there is any reason to suspect over-dispersion
>>> (and in many/most ecological applications there is), this
>>> is anti-conservative.  I judge this a misuse of statistical
>>> testing.  While, some do rely on the result of a test in these
>>> circumstances, I have never seen a credible defence of
>>> this practice.
>>> 
>>> 3) In fitting a quasi model using glm(), McCullagh and Nelder
>>> (which I do not have handy at the moment) argue, if I recall
>>> correctly, for use of the Pearson chi-square estimate.  The
>>> mean deviance is unduly susceptible to bias.
>>> 
>>> 4) Whereas the scale factor (sqrt dispersion estimate) is incorporated
>>> into the GLM residuals, the residuals from glmer() exclude all
>>> random effects except that due to poisson variation.  The residuals
>>> are what remains after accounting for all fixed and random effects,
>>> including observation level random effects.
>>> 
>>> 5) Your mdf divisor is too small.  Your stream, stream:rip and ID
>>> random terms account for further 'degrees of freedom'.  Maybe
>>> degrees of freedom are not well defined in this context?  Anyone
>>> care to comment?  The size of this quantity cannot, in any case, be
>>> used to indicate over-fitting or under-fitting.  The model assumes
>>> a theoretical value of 1.  Apart from bias in the estimate, the
>>> residuals are constrained by the model to have magnitudes that
>>> are consistent with this theoretical value.
>>> 
>>> 6) If you fit a non-quasi error (binomial or poisson) in a glm model,
>>> the summary output has a column labeled "z value".  If you fit a quasi
>>> error, the corresponding column is labeled "t value".  In the glmer
>>> output, the label 'z value' is in my view almost always inappropriate.
>>> To the extent that the description carries across, it is the counterpart
>>> of the "t value" column in the glm output with the quasi error term.
>>> (Actually, in the case where the denominator is entirely composed
>>> from the theoretical variance, Z values that are as near as maybe
>>> identical can almost always [always?] be derived using an
>>> appropriate glm model with a non-quasi error term.)
>>> 
>>> John Maindonald             email: john.maindonald at anu.edu.au
>>> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
>>> Centre for Mathematics & Its Applications, Room 1194,
>>> John Dedman Mathematical Sciences Building (Building 27)
>>> Australian National University, Canberra ACT 0200.
>>> http://www.maths.anu.edu.au/~johnm
>>> 
>>> On 12/02/2011, at 12:58 PM, Colin Wahl wrote:
>>> 
>>>> In anticipation of the weekend:
>>>> In my various readings(crawley, zuur, bolker's ecological models book, and
>>>> the GLMM_TREE article, reworked supplementary material and R help posts) the
>>>> discussion of overdispersion for glmm is quite convoluted by different
>>>> interpretations, different ways to test for it, and different solutions to
>>>> deal with it. In many cases differences seem to stem from the type of data
>>>> being analyzed (e.g. binomial vs. poisson) and somewhat subjective options
>>>> for which type of residuals to use for which models.
>>>> 
>>>> The most consistent definition I have found is overdispersion is defined by
>>>> a ratio of residual scaled deviance to the residual degrees of freedom > 1.
>>>> 
>>>> Which seems simple enough.
>>>>> modelB<-glmer(E ~ wsh*rip + (1|stream) + (1|stream:rip), data=ept,
>>>> family=binomial(link="logit"))
>>>>> rdev <- sum(residuals(modelBQ)^2)
>>>>> mdf <- length(fixef(modelBQ))
>>>>> rdf <- nrow(ept)-mdf
>>>>> rdev/rdf #9.7 >>1
>>>> 
>>>> So I conclude my model is overdispersed. The recent consensus solution seems
>>>> to be to create and add a individual level random variable to the model.
>>>> 
>>>> ept$obs <- 1:nrow(ept) #create individual level random variable 1:72
>>>> modelBQ<-glmer(E ~ wsh*rip + (1|stream) + (1|stream:rip) + (1|obs),
>>>> data=ept, family=binomial(link="logit"))
>>>> 
>>>> I take a look at the residuals which are now much smaller but are... just...
>>>> too... good... for my ecological (glmm free) experience to be comfortable
>>>> with. Additionally, they fit better for intermediate data, which, with
>>>> binomial errors is the opposite of what I would expect. Feel free to inspect
>>>> them in the attached image (if attachments work via mail list... if not, I
>>>> can send it directly to whomever is interested).
>>>> 
>>>> Because it looks too good... I test overdispersion again for the new model:
>>>> 
>>>> rdev/rdf #0.37
>>>> 
>>>> Which is terrifically underdispersed, for which the consensus is to ignore
>>>> it (Zuur et al. 2009).
>>>> 
>>>> So, for my questions:
>>>> 1. Is there anything relevant to add to/adjust in my approach thus far?
>>>> 2. Is overdispersion an issue I should be concerned with for binomial
>>>> errors? Most sources think so, but I did find a post from Jerrod Hadfield
>>>> back in august where he states that overdispersion does not exist with a
>>>> binary response variable:
>>>> http://web.archiveorange.com/archive/v/rOz2zS8BHYFloUr9F0Ut (though in
>>>> subsequent posts he recommends the approach I have taken by using an
>>>> individual level random variable).
>>>> 3. Another approach (from Bolker's TREE_GLMM article) is to use Wald t or F
>>>> tests instead of Z or X^2 tests to get p values because they "account for
>>>> the uncertainty in the estimates of overdispersion." That seems like a nice
>>>> simple option, I have not seen this come up in any other readings. Thoughts?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Here are the glmer model outputs:
>>>> 
>>>> ModelB
>>>> Generalized linear mixed model fit by the Laplace approximation
>>>> Formula: E ~ wsh * rip + (1 | stream) + (1 | stream:rip)
>>>> Data: ept
>>>> AIC BIC logLik deviance
>>>> 754.3 777 -367.2    734.3
>>>> Random effects:
>>>> Groups     Name        Variance Std.Dev.
>>>> stream:rip (Intercept) 0.48908  0.69934
>>>> stream     (Intercept) 0.18187  0.42647
>>>> Number of obs: 72, groups: stream:rip, 24; stream, 12
>>>> 
>>>> Fixed effects:
>>>>        Estimate Std. Error z value Pr(>|z|)
>>>> (Intercept) -4.28529    0.50575  -8.473  < 2e-16 ***
>>>> wshd        -2.06605    0.77357  -2.671  0.00757 **
>>>> wshf         3.36248    0.65118   5.164 2.42e-07 ***
>>>> wshg         3.30175    0.76962   4.290 1.79e-05 ***
>>>> ripN         0.07063    0.61930   0.114  0.90920
>>>> wshd:ripN    0.60510    0.94778   0.638  0.52319
>>>> wshf:ripN   -0.80043    0.79416  -1.008  0.31350
>>>> wshg:ripN   -2.78964    0.94336  -2.957  0.00311 **
>>>> 
>>>> ModelBQ
>>>> 
>>>> Generalized linear mixed model fit by the Laplace approximation
>>>> Formula: E ~ wsh * rip + (1 | stream) + (1 | stream:rip) + (1 | obs)
>>>> Data: ept
>>>> AIC   BIC logLik deviance
>>>> 284.4 309.5 -131.2    262.4
>>>> Random effects:
>>>> Groups     Name        Variance Std.Dev.
>>>> obs        (Intercept) 0.30186  0.54942
>>>> stream:rip (Intercept) 0.40229  0.63427
>>>> stream     (Intercept) 0.12788  0.35760
>>>> Number of obs: 72, groups: obs, 72; stream:rip, 24; stream, 12
>>>> 
>>>> Fixed effects:
>>>>        Estimate Std. Error z value Pr(>|z|)
>>>> (Intercept)  -4.2906     0.4935  -8.694  < 2e-16 ***
>>>> wshd         -2.0557     0.7601  -2.705  0.00684 **
>>>> wshf          3.3575     0.6339   5.297 1.18e-07 ***
>>>> wshg          3.3923     0.7486   4.531 5.86e-06 ***
>>>> ripN          0.1425     0.6323   0.225  0.82165
>>>> wshd:ripN     0.3708     0.9682   0.383  0.70170
>>>> wshf:ripN    -0.8665     0.8087  -1.071  0.28400
>>>> wshg:ripN    -3.1530     0.9601  -3.284  0.00102 **
>>>> 
>>>> 
>>>> Cheers,
>>>> --
>>>> Colin Wahl
>>>> Department of Biology
>>>> Western Washington University
>>>> Bellingham WA, 98225
>>>> ph: 360-391-9881
>>>> <ModelComp2.png>_______________________________________________
>>>> R-sig-mixed-models at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>> 
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> 
>> ================================================================
>> Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
>> Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
>> 824 Timberlake Drive                     Tel: 757-467-0954
>> Virginia Beach, VA 23464-3239            Fax: 757-467-2947
>> 
>> "Vere scire est per causas scire"
>> ================================================================
>> 
> 




More information about the R-sig-mixed-models mailing list