[R-sig-ME] overdispersion with binomial data?
John Maindonald
john.maindonald at anu.edu.au
Sun Feb 13 01:04:01 CET 2011
This is a useful clarification. Jared might perhaps have said
"cannot be accounted for in the error term".
What one can do (in the binary glm model) is to fit group as a fixed
effect, then use the deviance that is due to group to estimate
the overdispersion. Or one can use a glmm model with group
as a random effect.
John Maindonald email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm
On 13/02/2011, at 9:27 AM, John Maindonald wrote:
> "However, most people seem to ignore overdispersion estimates (chi-square/df) if they are less than about 1.5 or so as a practical matter."
>
> If there is large uncertainty in the overdispersion estimate, then adjusting for the
> overdispersion is to trade bias for that uncertainty. If the argument is that the
> bias is preferable, p-values should be adjusted for the long-term (over multiple
> studies) bias. For an overdispersion that averages out at around 1.5, a p-value
> that appears as 0.05 becomes, depending on degrees of freedom, around 0.1
>
> Sure, the deviance and Pearson chi-square are commonly quite close. The
> preference for the Pearson chi-square, as against the mean deviance, is not
> however arbitrary. The reduced bias is, over multiple analyses, worth having.
> If the Poisson mean is small, or many of the binomial proportions are close to 0
> or to 1, it is noticeable.
>
> ---------
>> zp <- rpois(20, 0.25)
>> summary(glm(zp~1, family=quasipoisson))
> . . . .
>
> (Dispersion parameter for quasipoisson family taken to be 0.7895812)
>
> Null deviance: 13.863 on 19 degrees of freedom
> Residual deviance: 13.863 on 19 degrees of freedom
> --------
>
> Compare the mean chi-square = 0.73 = 13.86/19 with a Pearson chi-square
> estimate (as above) that equals 0.79
>
> The preference for the Pearson chi-square, as against the mean deviance,
> is not arbitrary.
>
>
> John Maindonald email: john.maindonald at anu.edu.au
> phone : +61 2 (6125)3473 fax : +61 2(6125)5549
> Centre for Mathematics & Its Applications, Room 1194,
> John Dedman Mathematical Sciences Building (Building 27)
> Australian National University, Canberra ACT 0200.
> http://www.maths.anu.edu.au/~johnm
>
> On 13/02/2011, at 5:16 AM, Robert A LaBudde wrote:
>
>> I don't believe in tests generally, so I agree with your point 1) in principle.
>>
>> However, most people seem to ignore overdispersion estimates (chi-square/df) if they are less than about 1.5 or so as a practical matter. But if you have a reasonable amount of data and get an effect of 10 as in the example given, minor issues as to a posteriori testing are irrelevant. Ignoring an apparent overdispersion of 10 does not seem sensible.
>>
>> The question as to whether deviance or Pearson chi-square is used is a minor issue, as the two are almost invariably quite close anyway.
>>
>> In the end, we must all agree that the model must include all important effects to be useful.
>>
>> At 01:04 AM 2/12/2011, John Maindonald wrote:
>>> 1) Different types of residuals serve different purposes.
>>>
>>> 2) I am of the school that thinks it misguided to use the
>>> results of a test for overdispersion to decide whether to
>>> model it. If there is any reason to suspect over-dispersion
>>> (and in many/most ecological applications there is), this
>>> is anti-conservative. I judge this a misuse of statistical
>>> testing. While, some do rely on the result of a test in these
>>> circumstances, I have never seen a credible defence of
>>> this practice.
>>>
>>> 3) In fitting a quasi model using glm(), McCullagh and Nelder
>>> (which I do not have handy at the moment) argue, if I recall
>>> correctly, for use of the Pearson chi-square estimate. The
>>> mean deviance is unduly susceptible to bias.
>>>
>>> 4) Whereas the scale factor (sqrt dispersion estimate) is incorporated
>>> into the GLM residuals, the residuals from glmer() exclude all
>>> random effects except that due to poisson variation. The residuals
>>> are what remains after accounting for all fixed and random effects,
>>> including observation level random effects.
>>>
>>> 5) Your mdf divisor is too small. Your stream, stream:rip and ID
>>> random terms account for further 'degrees of freedom'. Maybe
>>> degrees of freedom are not well defined in this context? Anyone
>>> care to comment? The size of this quantity cannot, in any case, be
>>> used to indicate over-fitting or under-fitting. The model assumes
>>> a theoretical value of 1. Apart from bias in the estimate, the
>>> residuals are constrained by the model to have magnitudes that
>>> are consistent with this theoretical value.
>>>
>>> 6) If you fit a non-quasi error (binomial or poisson) in a glm model,
>>> the summary output has a column labeled "z value". If you fit a quasi
>>> error, the corresponding column is labeled "t value". In the glmer
>>> output, the label 'z value' is in my view almost always inappropriate.
>>> To the extent that the description carries across, it is the counterpart
>>> of the "t value" column in the glm output with the quasi error term.
>>> (Actually, in the case where the denominator is entirely composed
>>> from the theoretical variance, Z values that are as near as maybe
>>> identical can almost always [always?] be derived using an
>>> appropriate glm model with a non-quasi error term.)
>>>
>>> John Maindonald email: john.maindonald at anu.edu.au
>>> phone : +61 2 (6125)3473 fax : +61 2(6125)5549
>>> Centre for Mathematics & Its Applications, Room 1194,
>>> John Dedman Mathematical Sciences Building (Building 27)
>>> Australian National University, Canberra ACT 0200.
>>> http://www.maths.anu.edu.au/~johnm
>>>
>>> On 12/02/2011, at 12:58 PM, Colin Wahl wrote:
>>>
>>>> In anticipation of the weekend:
>>>> In my various readings(crawley, zuur, bolker's ecological models book, and
>>>> the GLMM_TREE article, reworked supplementary material and R help posts) the
>>>> discussion of overdispersion for glmm is quite convoluted by different
>>>> interpretations, different ways to test for it, and different solutions to
>>>> deal with it. In many cases differences seem to stem from the type of data
>>>> being analyzed (e.g. binomial vs. poisson) and somewhat subjective options
>>>> for which type of residuals to use for which models.
>>>>
>>>> The most consistent definition I have found is overdispersion is defined by
>>>> a ratio of residual scaled deviance to the residual degrees of freedom > 1.
>>>>
>>>> Which seems simple enough.
>>>>> modelB<-glmer(E ~ wsh*rip + (1|stream) + (1|stream:rip), data=ept,
>>>> family=binomial(link="logit"))
>>>>> rdev <- sum(residuals(modelBQ)^2)
>>>>> mdf <- length(fixef(modelBQ))
>>>>> rdf <- nrow(ept)-mdf
>>>>> rdev/rdf #9.7 >>1
>>>>
>>>> So I conclude my model is overdispersed. The recent consensus solution seems
>>>> to be to create and add a individual level random variable to the model.
>>>>
>>>> ept$obs <- 1:nrow(ept) #create individual level random variable 1:72
>>>> modelBQ<-glmer(E ~ wsh*rip + (1|stream) + (1|stream:rip) + (1|obs),
>>>> data=ept, family=binomial(link="logit"))
>>>>
>>>> I take a look at the residuals which are now much smaller but are... just...
>>>> too... good... for my ecological (glmm free) experience to be comfortable
>>>> with. Additionally, they fit better for intermediate data, which, with
>>>> binomial errors is the opposite of what I would expect. Feel free to inspect
>>>> them in the attached image (if attachments work via mail list... if not, I
>>>> can send it directly to whomever is interested).
>>>>
>>>> Because it looks too good... I test overdispersion again for the new model:
>>>>
>>>> rdev/rdf #0.37
>>>>
>>>> Which is terrifically underdispersed, for which the consensus is to ignore
>>>> it (Zuur et al. 2009).
>>>>
>>>> So, for my questions:
>>>> 1. Is there anything relevant to add to/adjust in my approach thus far?
>>>> 2. Is overdispersion an issue I should be concerned with for binomial
>>>> errors? Most sources think so, but I did find a post from Jerrod Hadfield
>>>> back in august where he states that overdispersion does not exist with a
>>>> binary response variable:
>>>> http://web.archiveorange.com/archive/v/rOz2zS8BHYFloUr9F0Ut (though in
>>>> subsequent posts he recommends the approach I have taken by using an
>>>> individual level random variable).
>>>> 3. Another approach (from Bolker's TREE_GLMM article) is to use Wald t or F
>>>> tests instead of Z or X^2 tests to get p values because they "account for
>>>> the uncertainty in the estimates of overdispersion." That seems like a nice
>>>> simple option, I have not seen this come up in any other readings. Thoughts?
>>>>
>>>>
>>>>
>>>>
>>>> Here are the glmer model outputs:
>>>>
>>>> ModelB
>>>> Generalized linear mixed model fit by the Laplace approximation
>>>> Formula: E ~ wsh * rip + (1 | stream) + (1 | stream:rip)
>>>> Data: ept
>>>> AIC BIC logLik deviance
>>>> 754.3 777 -367.2 734.3
>>>> Random effects:
>>>> Groups Name Variance Std.Dev.
>>>> stream:rip (Intercept) 0.48908 0.69934
>>>> stream (Intercept) 0.18187 0.42647
>>>> Number of obs: 72, groups: stream:rip, 24; stream, 12
>>>>
>>>> Fixed effects:
>>>> Estimate Std. Error z value Pr(>|z|)
>>>> (Intercept) -4.28529 0.50575 -8.473 < 2e-16 ***
>>>> wshd -2.06605 0.77357 -2.671 0.00757 **
>>>> wshf 3.36248 0.65118 5.164 2.42e-07 ***
>>>> wshg 3.30175 0.76962 4.290 1.79e-05 ***
>>>> ripN 0.07063 0.61930 0.114 0.90920
>>>> wshd:ripN 0.60510 0.94778 0.638 0.52319
>>>> wshf:ripN -0.80043 0.79416 -1.008 0.31350
>>>> wshg:ripN -2.78964 0.94336 -2.957 0.00311 **
>>>>
>>>> ModelBQ
>>>>
>>>> Generalized linear mixed model fit by the Laplace approximation
>>>> Formula: E ~ wsh * rip + (1 | stream) + (1 | stream:rip) + (1 | obs)
>>>> Data: ept
>>>> AIC BIC logLik deviance
>>>> 284.4 309.5 -131.2 262.4
>>>> Random effects:
>>>> Groups Name Variance Std.Dev.
>>>> obs (Intercept) 0.30186 0.54942
>>>> stream:rip (Intercept) 0.40229 0.63427
>>>> stream (Intercept) 0.12788 0.35760
>>>> Number of obs: 72, groups: obs, 72; stream:rip, 24; stream, 12
>>>>
>>>> Fixed effects:
>>>> Estimate Std. Error z value Pr(>|z|)
>>>> (Intercept) -4.2906 0.4935 -8.694 < 2e-16 ***
>>>> wshd -2.0557 0.7601 -2.705 0.00684 **
>>>> wshf 3.3575 0.6339 5.297 1.18e-07 ***
>>>> wshg 3.3923 0.7486 4.531 5.86e-06 ***
>>>> ripN 0.1425 0.6323 0.225 0.82165
>>>> wshd:ripN 0.3708 0.9682 0.383 0.70170
>>>> wshf:ripN -0.8665 0.8087 -1.071 0.28400
>>>> wshg:ripN -3.1530 0.9601 -3.284 0.00102 **
>>>>
>>>>
>>>> Cheers,
>>>> --
>>>> Colin Wahl
>>>> Department of Biology
>>>> Western Washington University
>>>> Bellingham WA, 98225
>>>> ph: 360-391-9881
>>>> <ModelComp2.png>_______________________________________________
>>>> R-sig-mixed-models at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>> ================================================================
>> Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com
>> Least Cost Formulations, Ltd. URL: http://lcfltd.com/
>> 824 Timberlake Drive Tel: 757-467-0954
>> Virginia Beach, VA 23464-3239 Fax: 757-467-2947
>>
>> "Vere scire est per causas scire"
>> ================================================================
>>
>
More information about the R-sig-mixed-models
mailing list