[R-sig-ME] glmer overdispersion correction, family = binomial
John Maindonald
john.maindonald at anu.edu.au
Fri Mar 4 06:50:05 CET 2011
My comments related to the model with the observation level random
effects. It is the rdf divisor for this model that is too small. (The rdf will
also be somewhat small for a random effects model model that does
not have observation level random effects, but the bias will be much
smaller.) It is the residuals in the model with the observation level
random effects that
"are constrained by the model to have magnitudes that are consistent
with this theoretical variance that equals 1."
All variance that is not otherwise accounted for is gathered up by the
observation level random effects. You do not have evidence that
"the fix was overkill".
The dispersion estimate for the observation level random effects
model will certainly not tell you anything about possible overdisperion.
I suppose that if you had a model that was under-dispersed before
adding the observation level random effects, you might (if you could
work out some sensible way to calculate rdf) get a dispersion estimate
that was less than 1. But, in that case, you would not want to fit the
observation level random effects in the first place.
John Maindonald email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm
On 04/03/2011, at 4:18 PM, Colin Wahl wrote:
> Good day,
>
> A few weeks back I sent out a message on the mixed effect list in an attempt
> to clarify an acceptable approach to test hypothesis when overdispersion is
> present. I had tried adding an observation level random effect to the model
> as per recommendations in the archives, however, this fix was overkill and
> overfit the model (with a dispersion parameter of 0.3).
>
> The following message is rather long and full of questions relevant to my
> understanding. This overdispersion issue has become quite the impasse for me
> (one among many). I'm grateful for any and all help everyone has provided
> thus far. The existence of a helpful online community such as this is why I
> decided to learn R in the first place. Thank you.
>
> Background:
> I am using a glmm with binomial errors. My dispersion factor is 9.7
> calculated using the following method:
> rdev <- sum(residuals(modelB,"pearson")^2)
> mdf <- length(fixef(modelB))
> rdf <- nrow(ept)-mdf[image: Collapse]
> rdev/rdf # =9.7
>
> My model is:
> modelB<-glmer(E ~ wsh*rip + (1|stream) + (1|stream:rip), data=ept,
> family=binomial(link="logit"))
>
> The structure of my data is copied below the body.
>
>
>
> Considering the dispersion parameter:
>
> John Maindonald suggested in his feb 12th message, that the mdf is too small
> in this calculation because random effects account for some unknown
> (somewhere between 1 and 12) df.
>
> If mdf is too small, then the dispersion factor will be even larger. Is this
> value (9.7) accurate or acceptably so?
>
> J.Maindonald also said: "The size of this quantity cannot, in any case, be
> used to indicate over-fitting or under-fitting. The model assumes a
> theoretical value of 1. Apart from bias in the estimate, the residuals are
> constrained by the model to have magnitudes that are consistent with this
> theoretical value."
>
> Does this mean the dispersion parameter I've calculated (9.7) cannot be used
> to determine if the model is overdispersed?
>
>
>
> Concerning previously suggested fixes:
>
> The responses to my feb 11th message were somewhat helpful, but perhaps a
> bit cryptic for my still novice R modeling skills. I thank all those that
> offered their help nonetheless. I had so many questions (and was a bit
> intimidated) I dug into books before asking everyone on the list. I
> apologize for not responding to last month's messages with more urgency.
>
> Robert LaBudde (Feb,11) suggested 3 options to deal with overdispersion
> 1. Easy option: I multiply confidence limits by 3.1 (sqrt(9.7))
> 2. Better option: fit using a quasi, negative, or beta, binomial.
> 3. Best option: examine data and find where clustering is occurring and find
> a causal explanation for it. Then adjust model to account for extra binomial
> variation.
>
> 1. I understand that profiling and mcmc can calculate CIs but Is there a way
> to get CIs out of glmer?
> 2. glmer does not support quasi families. I understand that they were
> removed because they were not reliable. Does this mean they will also not be
> reliable with other packages? I am reluctant to step away from glmer, which
> I have become comfortable with. Comfort is not a particularly good reason to
> avoid exploring other options, so if anyone has any suggestions for methods
> where I can use quasi distributions properly, I would love to hear them.
> 3. How do I determine where clustering occurs? How is overdispersion is
> caused by clustering?
>
>
> Concerning other potential solutions:
>
> Wald tests:
> When overdispersion is present Bolker et al's 2009 TREE article suggests
> using Wald t or F tests for hypothesis testing for fixed factors and LR
> testing for random factors. Another possible approach suggested in the
> article is to use PQL (because my numbers of failures and successes are much
> larger than 5 for each treatment combination) with Wald's t or F. In
> subsequent posts Bolker suggested that the Laplace is probably just as good
> if not better than PQL (and ?glmer suggests that GHQ is better still). I
> also understand from this post that Wald's t is the same as the z statistic.
> The difference is which distribution is used (norm(z), or t). Is this also
> true for Wald's F?
> Is it possible to specify which distribution for glmer to use?
> If not how do I determine which df glmer is using for its initial
> calculation to do it manually?
> How does glmer calcuate df?
> How would I approach this using the F distribution?
>
> Parametric bootstrapping and MCMC:
> Recently on this list Bolker suggestion that people use parametric
> bootstrapping or something MCMC based if p values really matter. As my
> questions have been formulated to test hypotheses (Are there differences in
> EPT abundance between watershed types? between riparian types?), its my take
> that hypothesis testing does matter.
>
> Are parametric bootstrapping and MCMC based methods appropriate or necessary
> in my case? I have no prior knowledge of baysian statistics or
> bootstrapping, so I am hesitant, but if that is the path I must take through
> the jungle, so be it.
>
> Since my last message I have, in my search for a "When there is
> overdispersion you can correct for it by..." type of source, found a nice,
> quite tangible (for me) discussion for how to correct for overdispersion
> when assessing the significance of parameters from Zurr et al 2007 (*Ecological
> Data). *I should preface the following by clarifying that this is from a
> chapter on glms not glmms.
>
> The correction uses the difference in deviance between 2 models: Model 1
> (D1) and Model 2 (D2) divided by the dispersion parameter (ϕ : 9.7)
> multiplied by the difference in the number of parameters (p, q) for each
> model.
>
> The result follows F distribution with a df of p-q and n-p : F~ (D1-D2) /
> (ϕ(p-q)). The text seems to suggest that this is what using a quasibinomial
> family does by incorporating ϕ but I am not certain.
>
> So I have done this, but likely not correctly. I expect my count for the the
> number of parameters in the different models is inaccurate, especially when
> the difference in parameter count is due to the removal of a random factor.
> Considering this approach *was* for GLMs, not GLMMs, perhaps this approach
> wont work, perhaps for the above reason. The chapter certainly does not
> expand parameter counting period, especially in the case of a random factor.
> Am I busy sweeping leaves on a windy day? Thoughts?
>
> Primary questions:
> 1. Is the GHQ be more accurate than the Laplace approx? Will using the Wald
> t, as suggested in the TREE article, suffice for determining significance?
> If so, should there be a caveat that this is an approximation that is likely
> anti-conservative?
>
> 2. Will the LR test be of use to me for random factors? fixed? How is LRT
> affected by overdispersion?
>
> 3. Does the deviance test I described from Zuur et al 2007 provide me with
> some promise? How should I go about determining (or approximating?) the
> number of parameters for fixed vs. random factors? Is it possible?
>
> 4. Should I try using an alternate binomial distribution (quasi, negative,
> beta)? Which package should I use?
>
> 5. Any other suggestions?
>
>
>> str(ept[,2:5])
> 'data.frame': 72 obs. of 4 variables:
> $ wsh : Factor w/ 4 levels "c","d","f","g": 1 1 1 1 1 1 1 1 1 1 ...
> $ stream: Factor w/ 12 levels "BA","D1","D2",..: 4 4 4 6 6 6 10 10 10 4 ...
> $ rip : Factor w/ 2 levels "F","N": 1 1 1 1 1 1 1 1 1 2 ...
> $ ept : num 0.0977 0.0242 0.037 0.0213 0.0248 ...
>
> --
> Colin Wahl
> MSc student
> Department of Biology
> Western Washington University
> Bellingham WA, 98225
> ph: 360-391-9881
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
More information about the R-sig-mixed-models
mailing list