[R-sig-ME] likelihood-ratio tests in conflict with coefficiants in maximal random effect model

Fri Feb 28 21:27:02 CET 2014

As one of the co-authors of the Barr et al. 2013 paper, I would like to clarify what I believe are some misunderstandings of the central message of our paper…[see below]

On Feb 28, 2014, at 9:27 AM, Douglas Bates <bates at stat.wisc.edu> wrote:

> On Fri, Feb 28, 2014 at 10:04 AM, Emilia Ellsiepen <
> emilia.ellsiepen at gmail.com> wrote:
> 
>> Dear list members,
>> 
>> in analyzing a data set using lmers with maximal random effect
>> structure and subsequent likelihood-ratio tests (LRTs) following Barr
>> et al. 2013, I ran into the following problem: In some of the LRTs, it
>> turned out that the simpler model (only main effects) has a higher
>> likelihood than the more complicated model (including interaction),
>> resulting in Chi=0. If I simplify the models by taking out the
>> interactions in the two random effect terms, the LRT for the
>> interaction has a highly significant result.
>> 
>> Now, I figured this might either be a strong case for Barr et al., or
>> something is going wrong, as the data do suggest an interaction, and
>> the t-value for the interaction coefficiant in the model summary of
>> the maximal random effect model is 1.69 - which I would have expected
>> to be a (close to) marginal effect. Also, a lot of researchers who do
>> use maximal random effect models rather interpret the t-values
>> directly instead of running LRTs, which then might lead to different
>> interpretations of the same data.
>> 
> 
> It shows that the more complex model has not converged to the optimum
> parameter values.  This can be because the optimizer being used is a bad
> choice (in recent versions of the lme4 package the default was a
> Nelder-Mead optimizer that can declare convergence to values that are not
> the optimal values) or it can be because the model is too complex.  We say
> that such models are over-parameterized.

First off, it is not clear that Emilia’s specific problem is being caused by over-parameterization.  Emilia, could you perhaps give more information about the nature of the dataset that you’re analyzing?  Is it a 2x2 within-subjects, within-sentence balanced design without a great deal of missing data?  In my experience with the last few pre-1.0 versions, lme4 is generally very good at converging to an optimum for these kinds of datasets with the number of observations and groups your fitted model reports.  Have you tried fitting the model with the nlminb optimizer, either by including 

optimizer="optimx",optCtrl=list(method="nlminb”)

in the list of arguments to lmerControl, or by using the last pre-1.0 version of lme4 (available as lme4.0 on R-Forge)?  Do you still get similar problems with the nlminb optimizer?  (You should definitely not get the result that the simpler model has a higher log-likelihood.)

> This is why the Barr et al. advice is dangerous.  In model selection there
> are two basic strategies: forward and backward.  Forward selection starts
> with a simple model and adds terms until they are no longer justified.
> Backward selection starts with the most complex model and drops terms.  It
> is well known that backward selection is problematic when you can't fit the
> most complex model.  Yet Barr et al. say unequivocally that you must use
> backward selection.  

We don’t say that you must use backward selection.  Rather, we argue against model selection altogether for determining random-effects specification for confirmatory hypothesis testing.  We advocate “design-driven” random-effects specification for confirmatory hypothesis testing (see pages 262-263).  That is, if you are fitting a model with the goal of testing the effect of some predictor X and X varies within levels of a random effects grouping factor (say, in Emilia’s case, within subjects and sentences), then a within-grouping-factor random slope for X plays a critical role as, effectively, a potential confound that should be kept in the model as a control.  If X varies across levels, the within-grouping-factor random intercept plays that role.

One might object by saying that what we advocate is even worse, because keeping all the random effects in will make it even less likely that the models converge to a global optimum.  But we make a distinction between fixed-effects predictors of critical theoretical interest for one’s confirmatory hypothesis test and other fixed-effects predictors in the model.  It is only for the former that we are arguing against model selection; for the latter, we think that more flexibility is probably reasonable.  Typically, for any particular hypothesis test, we are not talking about a huge number of parameters that must be included in order to satisfy design-driven requirements.  (We discuss this issue on page 275.)

> The result will be that many researchers, especially
> in linguistics, will encounter these problems of complex models providing
> worse fits than simpler models.

Once again, I don’t think that there is enough evidence in the original post to conclude that this necessarily was a problem of overparameterization.

> I wish that Barr et al. would have provided software that is guaranteed to
> fit even the most complex model to a global optimum when they stated their
> "one size fits all" strategy.  they didn't and those with experience in
> numerical optimization can tell you why.  It is not possible to guarantee
> convergence to an optimum in over-parameterized models.
> 
> This is why papers like Barr et al. that, based on a single simulation
> study, state absolute rules about the complex and subtle process of model
> selection, are dangerous.

Let me quote from the discussion section of our paper (page 276):

"Research is needed to evaluate the various possible strategies that one could follow when one cannot fit a maximal model. Both our theoretical analysis and simulations suggest a general rule of thumb: for whatever fixed effects are of critical interest, the corresponding random effects should be present in that analysis."

I am sorry if this statement, which I think is a reasonable summary of our overall position, seems like an absolute rule or a one-size-fits-all strategy; it was certainly not my intent that it come across as one, nor do I think it was the intent of my co-authors.  We recognize that model selection can be a complex and subtle process.  The goal of our paper was to provide readers with better guidance as to how design-driven considerations — which are extremely important for confirmatory hypothesis testing! — should be taken into account within this process.

Best

Roger

> My questions are:
>> 1) Is the LRT reported below reliable, indicating there is no
>> interaction, or does the non-expected difference in logLik suggest
>> that something went wrong?
>> 2) Is it possible that the control parameter that increases the number
>> of evaluations has something to do with it?
>> 
>> 
>> Here are some details:
>> The design is 2x2 within subjects and items, the factors are sum-coded.
>> I used REML=F to straightforwardly perform the LRTs afterwards and
>> increased the number of evaluations for the model to converge
>> 
>> m1 = lmer(score ~ Order*Voice +
>> (Order*Voice|subject)+(Order*Voice|sentence) ,REML=F,
>> control=lmerControl(optCtrl=list(maxfun=50000)),rawdata)
>> mi = lmer(score ~ Order+Voice +
>> (Order*Voice|subject)+(Order*Voice|sentence) ,REML=F,
>> control=lmerControl(optCtrl=list(maxfun=50000)),rawdata)
>> 
>> anova(m1,mi)
>> 
>> Models:
>> mi: score ~ Order + Voice + (Order * Voice | subject) + (Order *
>> mi:     Voice | sentence)
>> m1: score ~ Order * Voice + (Order * Voice | subject) + (Order *
>> m1:     Voice | sentence)
>> Df    AIC    BIC  logLik deviance Chisq Chi Df Pr(>Chisq)
>> mi 24 1633.0 1742.9 -792.48   1585.0
>> m1 25 1664.1 1778.5 -807.03   1614.1     0      1          1
>> 
>> 
>> summary(m1)
>> 
>> Linear mixed model fit by maximum likelihood ['lmerMod']
>> Formula: score ~ Order * Voice + (Order * Voice | subject) + (Order *
>> Voice | sentence)
>> 
>>  AIC       BIC    logLik  deviance
>> 1664.0644 1778.5457 -807.0322 1614.0644
>> 
>> Random effects:
>> Groups   Name          Variance  Std.Dev. Corr
>> subject  (Intercept)   0.0387850 0.19694
>>      Order1        0.0005166 0.02273  0.65
>>      Voice1        0.0326954 0.18082  0.71 0.55
>>      Order1:Voice1 0.0266763 0.16333  0.34 0.76 0.57
>> sentence (Intercept)   0.0683897 0.26151
>>      Order1        0.0779307 0.27916  1.00
>>      Voice1        0.0096621 0.09830  0.61 0.60
>>      Order1:Voice1 0.1013066 0.31829  0.55 0.55 0.34
>> Residual               0.4272769 0.65366
>> Number of obs: 720, groups: subject, 36; sentence, 20
>> 
>> Fixed effects:
>>          Estimate Std. Error t value
>> (Intercept)    0.26184    0.07135   3.670
>> Order1        -0.12645    0.06711  -1.884
>> Voice1        -0.10761    0.04455  -2.415
>> Order1:Voice1  0.13562    0.08000   1.695
>> 
>> Correlation of Fixed Effects:
>>        (Intr) Order1 Voice1
>> Order1      0.776
>> Voice1      0.469  0.294
>> Order1:Voc1 0.458  0.468  0.282
>> 
>> ----------------------------------------------------
>> Emilia Ellsiepen
>> Institut für Linguistik
>> Goethe-Universität Frankfurt
>> 
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> 
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models