[R-sig-ME] overdispersion with binomial data?

Sun Feb 13 03:50:00 CET 2011

Just to let you know: I love and respect your body of work, Prof. 
Maindonald, and your last comments are particularly pithy for the cognescenti.

However, if you follow the whole thread, how we got to this point is 
somewhat of a "chicken or egg?" problem.

If we always include "Group" in the model to check for 
overdispersion, and accept it if it tests positive but not if it 
doesn't, this biases our result in a particular way. If we exclude 
"Group", but test based on Pearson chi-square, this biases our result 
in the another way. Also, in neither case must the overdispersion fit 
the assumption involved (e.g., linear additive effect).

I believe any type of testing to choose among models to use biases 
the choice of models. I much prefer choosing models based upon 
subject matter expertise instead. I also prefer to base decisions on 
the size of an effect observed rather than upon its statistical 
detectability (significance) in a particular study.

In the present case, I'd feel uneasy about any model that didn't 
include a "Group" effect, whether or not it was detectable 
(significant). That's because it must be present logically, even if 
small. So the logical plan is to include and estimate it.

At 07:04 PM 2/12/2011, John Maindonald wrote:
>This is a useful clarification.  Jared might perhaps have said
>"cannot be accounted for in the error term".
>
>What one can do (in the binary glm model) is to fit group as a fixed
>effect, then use the deviance that is due to group to estimate
>the overdispersion. Or one can use a glmm model with group
>as a random effect.
>
>John Maindonald             email: john.maindonald at anu.edu.au
>phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
>Centre for Mathematics & Its Applications, Room 1194,
>John Dedman Mathematical Sciences Building (Building 27)
>Australian National University, Canberra ACT 0200.
>http://www.maths.anu.edu.au/~johnm
>
>On 13/02/2011, at 9:27 AM, John Maindonald wrote:
>
> > "However, most people seem to ignore overdispersion estimates 
> (chi-square/df) if they are less than about 1.5 or so as a practical matter."
> >
> > If there is large uncertainty in the overdispersion estimate, 
> then adjusting for the
> > overdispersion is to trade bias for that uncertainty.  If the 
> argument is that the
> > bias is preferable, p-values should be adjusted for the long-term 
> (over multiple
> > studies) bias.  For an overdispersion that averages out at around 
> 1.5, a p-value
> > that appears as 0.05 becomes, depending on degrees of freedom, around 0.1
> >
> > Sure, the deviance and Pearson chi-square are commonly quite close.  The
> > preference for the Pearson chi-square, as against the mean 
> deviance, is not
> > however arbitrary.  The reduced bias is, over multiple analyses, 
> worth having.
> > If the Poisson mean is small, or many of the binomial proportions 
> are close to 0
> > or to 1, it is noticeable.
> >
> > ---------
> >> zp <- rpois(20, 0.25)
> >> summary(glm(zp~1, family=quasipoisson))
> > . . . .
> >
> > (Dispersion parameter for quasipoisson family taken to be 0.7895812)
> >
> >  Null deviance: 13.863  on 19  degrees of freedom
> > Residual deviance: 13.863  on 19  degrees of freedom
> > --------
> >
> > Compare the mean chi-square = 0.73 = 13.86/19 with a Pearson chi-square
> > estimate (as above) that equals 0.79
> >
> > The preference for the Pearson chi-square, as against the mean deviance,
> > is not arbitrary.
> >
> >
> > John Maindonald             email: john.maindonald at anu.edu.au
> > phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
> > Centre for Mathematics & Its Applications, Room 1194,
> > John Dedman Mathematical Sciences Building (Building 27)
> > Australian National University, Canberra ACT 0200.
> > http://www.maths.anu.edu.au/~johnm
> >
> > On 13/02/2011, at 5:16 AM, Robert A LaBudde wrote:
> >
> >> I don't believe in tests generally, so I agree with your point 
> 1) in principle.
> >>
> >> However, most people seem to ignore overdispersion estimates 
> (chi-square/df) if they are less than about 1.5 or so as a 
> practical matter. But if you have a reasonable amount of data and 
> get an effect of 10 as in the example given, minor issues as to a 
> posteriori testing are irrelevant. Ignoring an apparent 
> overdispersion of 10 does not seem sensible.
> >>
> >> The question as to whether deviance or Pearson chi-square is 
> used is a minor issue, as the two are almost invariably quite close anyway.
> >>
> >> In the end, we must all agree that the model must include all 
> important effects to be useful.
> >>
> >> At 01:04 AM 2/12/2011, John Maindonald wrote:
> >>> 1) Different types of residuals serve different purposes.
> >>>
> >>> 2) I am of the school that thinks it misguided to use the
> >>> results of a test for overdispersion to decide whether to
> >>> model it.  If there is any reason to suspect over-dispersion
> >>> (and in many/most ecological applications there is), this
> >>> is anti-conservative.  I judge this a misuse of statistical
> >>> testing.  While, some do rely on the result of a test in these
> >>> circumstances, I have never seen a credible defence of
> >>> this practice.
> >>>
> >>> 3) In fitting a quasi model using glm(), McCullagh and Nelder
> >>> (which I do not have handy at the moment) argue, if I recall
> >>> correctly, for use of the Pearson chi-square estimate.  The
> >>> mean deviance is unduly susceptible to bias.
> >>>
> >>> 4) Whereas the scale factor (sqrt dispersion estimate) is incorporated
> >>> into the GLM residuals, the residuals from glmer() exclude all
> >>> random effects except that due to poisson variation.  The residuals
> >>> are what remains after accounting for all fixed and random effects,
> >>> including observation level random effects.
> >>>
> >>> 5) Your mdf divisor is too small.  Your stream, stream:rip and ID
> >>> random terms account for further 'degrees of freedom'.  Maybe
> >>> degrees of freedom are not well defined in this context?  Anyone
> >>> care to comment?  The size of this quantity cannot, in any case, be
> >>> used to indicate over-fitting or under-fitting.  The model assumes
> >>> a theoretical value of 1.  Apart from bias in the estimate, the
> >>> residuals are constrained by the model to have magnitudes that
> >>> are consistent with this theoretical value.
> >>>
> >>> 6) If you fit a non-quasi error (binomial or poisson) in a glm model,
> >>> the summary output has a column labeled "z value".  If you fit a quasi
> >>> error, the corresponding column is labeled "t value".  In the glmer
> >>> output, the label 'z value' is in my view almost always inappropriate.
> >>> To the extent that the description carries across, it is the counterpart
> >>> of the "t value" column in the glm output with the quasi error term.
> >>> (Actually, in the case where the denominator is entirely composed
> >>> from the theoretical variance, Z values that are as near as maybe
> >>> identical can almost always [always?] be derived using an
> >>> appropriate glm model with a non-quasi error term.)
> >>>
> >>> John Maindonald             email: john.maindonald at anu.edu.au
> >>> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
> >>> Centre for Mathematics & Its Applications, Room 1194,
> >>> John Dedman Mathematical Sciences Building (Building 27)
> >>> Australian National University, Canberra ACT 0200.
> >>> http://www.maths.anu.edu.au/~johnm
> >>>
> >>> On 12/02/2011, at 12:58 PM, Colin Wahl wrote:
> >>>
> >>>> In anticipation of the weekend:
> >>>> In my various readings(crawley, zuur, bolker's ecological 
> models book, and
> >>>> the GLMM_TREE article, reworked supplementary material and R 
> help posts) the
> >>>> discussion of overdispersion for glmm is quite convoluted by different
> >>>> interpretations, different ways to test for it, and different 
> solutions to
> >>>> deal with it. In many cases differences seem to stem from the 
> type of data
> >>>> being analyzed (e.g. binomial vs. poisson) and somewhat 
> subjective options
> >>>> for which type of residuals to use for which models.
> >>>>
> >>>> The most consistent definition I have found is overdispersion 
> is defined by
> >>>> a ratio of residual scaled deviance to the residual degrees of 
> freedom > 1.
> >>>>
> >>>> Which seems simple enough.
> >>>>> modelB<-glmer(E ~ wsh*rip + (1|stream) + (1|stream:rip), data=ept,
> >>>> family=binomial(link="logit"))
> >>>>> rdev <- sum(residuals(modelBQ)^2)
> >>>>> mdf <- length(fixef(modelBQ))
> >>>>> rdf <- nrow(ept)-mdf
> >>>>> rdev/rdf #9.7 >>1
> >>>>
> >>>> So I conclude my model is overdispersed. The recent consensus 
> solution seems
> >>>> to be to create and add a individual level random variable to the model.
> >>>>
> >>>> ept$obs <- 1:nrow(ept) #create individual level random variable 1:72
> >>>> modelBQ<-glmer(E ~ wsh*rip + (1|stream) + (1|stream:rip) + (1|obs),
> >>>> data=ept, family=binomial(link="logit"))
> >>>>
> >>>> I take a look at the residuals which are now much smaller but 
> are... just...
> >>>> too... good... for my ecological (glmm free) experience to be 
> comfortable
> >>>> with. Additionally, they fit better for intermediate data, which, with
> >>>> binomial errors is the opposite of what I would expect. Feel 
> free to inspect
> >>>> them in the attached image (if attachments work via mail 
> list... if not, I
> >>>> can send it directly to whomever is interested).
> >>>>
> >>>> Because it looks too good... I test overdispersion again for 
> the new model:
> >>>>
> >>>> rdev/rdf #0.37
> >>>>
> >>>> Which is terrifically underdispersed, for which the consensus 
> is to ignore
> >>>> it (Zuur et al. 2009).
> >>>>
> >>>> So, for my questions:
> >>>> 1. Is there anything relevant to add to/adjust in my approach thus far?
> >>>> 2. Is overdispersion an issue I should be concerned with for binomial
> >>>> errors? Most sources think so, but I did find a post from 
> Jerrod Hadfield
> >>>> back in august where he states that overdispersion does not exist with a
> >>>> binary response variable:
> >>>> http://web.archiveorange.com/archive/v/rOz2zS8BHYFloUr9F0Ut (though in
> >>>> subsequent posts he recommends the approach I have taken by using an
> >>>> individual level random variable).
> >>>> 3. Another approach (from Bolker's TREE_GLMM article) is to 
> use Wald t or F
> >>>> tests instead of Z or X^2 tests to get p values because they 
> "account for
> >>>> the uncertainty in the estimates of overdispersion." That 
> seems like a nice
> >>>> simple option, I have not seen this come up in any other 
> readings. Thoughts?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Here are the glmer model outputs:
> >>>>
> >>>> ModelB
> >>>> Generalized linear mixed model fit by the Laplace approximation
> >>>> Formula: E ~ wsh * rip + (1 | stream) + (1 | stream:rip)
> >>>> Data: ept
> >>>> AIC BIC logLik deviance
> >>>> 754.3 777 -367.2    734.3
> >>>> Random effects:
> >>>> Groups     Name        Variance Std.Dev.
> >>>> stream:rip (Intercept) 0.48908  0.69934
> >>>> stream     (Intercept) 0.18187  0.42647
> >>>> Number of obs: 72, groups: stream:rip, 24; stream, 12
> >>>>
> >>>> Fixed effects:
> >>>>        Estimate Std. Error z value Pr(>|z|)
> >>>> (Intercept) -4.28529    0.50575  -8.473  < 2e-16 ***
> >>>> wshd        -2.06605    0.77357  -2.671  0.00757 **
> >>>> wshf         3.36248    0.65118   5.164 2.42e-07 ***
> >>>> wshg         3.30175    0.76962   4.290 1.79e-05 ***
> >>>> ripN         0.07063    0.61930   0.114  0.90920
> >>>> wshd:ripN    0.60510    0.94778   0.638  0.52319
> >>>> wshf:ripN   -0.80043    0.79416  -1.008  0.31350
> >>>> wshg:ripN   -2.78964    0.94336  -2.957  0.00311 **
> >>>>
> >>>> ModelBQ
> >>>>
> >>>> Generalized linear mixed model fit by the Laplace approximation
> >>>> Formula: E ~ wsh * rip + (1 | stream) + (1 | stream:rip) + (1 | obs)
> >>>> Data: ept
> >>>> AIC   BIC logLik deviance
> >>>> 284.4 309.5 -131.2    262.4
> >>>> Random effects:
> >>>> Groups     Name        Variance Std.Dev.
> >>>> obs        (Intercept) 0.30186  0.54942
> >>>> stream:rip (Intercept) 0.40229  0.63427
> >>>> stream     (Intercept) 0.12788  0.35760
> >>>> Number of obs: 72, groups: obs, 72; stream:rip, 24; stream, 12
> >>>>
> >>>> Fixed effects:
> >>>>        Estimate Std. Error z value Pr(>|z|)
> >>>> (Intercept)  -4.2906     0.4935  -8.694  < 2e-16 ***
> >>>> wshd         -2.0557     0.7601  -2.705  0.00684 **
> >>>> wshf          3.3575     0.6339   5.297 1.18e-07 ***
> >>>> wshg          3.3923     0.7486   4.531 5.86e-06 ***
> >>>> ripN          0.1425     0.6323   0.225  0.82165
> >>>> wshd:ripN     0.3708     0.9682   0.383  0.70170
> >>>> wshf:ripN    -0.8665     0.8087  -1.071  0.28400
> >>>> wshg:ripN    -3.1530     0.9601  -3.284  0.00102 **
> >>>>
> >>>>
> >>>> Cheers,
> >>>> --
> >>>> Colin Wahl
> >>>> Department of Biology
> >>>> Western Washington University
> >>>> Bellingham WA, 98225
> >>>> ph: 360-391-9881
> >>>> <ModelComp2.png>_______________________________________________
> >>>> R-sig-mixed-models at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >>>
> >>> _______________________________________________
> >>> R-sig-mixed-models at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >>
> >> ================================================================
> >> Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
> >> Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
> >> 824 Timberlake Drive                     Tel: 757-467-0954
> >> Virginia Beach, VA 23464-3239            Fax: 757-467-2947
> >>
> >> "Vere scire est per causas scire"
> >> ================================================================
> >>
> >
>
>_______________________________________________
>R-sig-mixed-models at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"