[R-sig-ME] p-values vs likelihood ratios

Tue Feb 22 04:10:11 CET 2011

On 11-02-21 09:18 AM, Rubén Roa wrote:
> 
>> -----Mensaje original----- De:
>> r-sig-mixed-models-bounces at r-project.org 
>> [mailto:r-sig-mixed-models-bounces at r-project.org] En nombre de Ben
>> Bolker Enviado el: lunes, 21 de febrero de 2011 14:25 Para:
>> r-sig-mixed-models at r-project.org Asunto: Re: [R-sig-ME] p-values vs
>> likelihood ratios
>> 
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>> 
>> On 11-02-21 12:09 AM, Mike Lawrence wrote:
>>> Hi folks,
>>> 
>>> I've noticed numerous posts here that discuss the
>> appropriateness of
>>> p-values obtained by one method or another in the context of
>>> mixed effects modelling. Following these discussions, I have an
>> observation
>>> (mini-rant) then a question.
>>> 
>>> First the observation:
>>> 
>>> I am not well versed in the underlying mathematical
>> mechanics of mixed
>>> effects modelling, but I would like to suggest that the apparent
>>>  difficulty of determining appropriate p-values itself may be a
>>> sign that something is wrong with the whole idea of using mixed
>>> effects modelling as a means of implementing a null-hypothesis
>>> testing approach to data analysis. That is, despite the
>> tradition-based fetish
>>> for p-values generally encountered in the peer-review process,
>>> null hypothesis significance testing itself is inappropriate for
>>> 
>> most cases
>>> of data analysis. p-values are for politicians; they help inform
>>>  one-off decisions by fixing the rate at which one specific type
>>> of decision error will occur (notably ignoring other types of
>>> decision errors). Science on the other hand is a cumulative
>>> process that is harmed by dichotmized and incomplete
>>> representation of the data as 
>>> null-rejected/fail-to-reject-the-null. Data analysis in
>> science should
>>> be about quantifying and comparing evidence between models of the
>>>  process that generated the data. My impression is that the
>> likelihood
>>> ratio (n.b. not likelihood ratio *test*) is an easily computed 
>>> quantity that facilitates quantitative representation of such 
>>> comparison of evidence.
>> 
>> Yes, although I don't personally think there's anything 
>> fundamentally wrong with p values when used properly (I know Royall
>> (1993) states that even in the Fisherian 'strength of evidence'
>> framework they are flawed ...)
>> 
>>> 
>>> Now the question:
>>> 
>>> Am I being naive in thinking that there are no nuances to the 
>>> computation of likelihood ratios and appropriateness of their 
>>> interpretation in the mixed effects modelling context? To provide
>>>  fodder for criticism, here are a few ways in which I
>> imagine computing
>>> then interpreting likelihood ratios:
>>> 
>>> Evaluation of evidence for or against a fixed effect: m0 = lmer(
>>> dv ~ (1|rand) + 1 ) m1 = lmer( dv ~ (1|rand) + iv ) 
>>> AIC(m0)-AIC(m1)
>>> 
>>> Evaluation of evidence for or against an interaction
>> between two fixed effects:
>>> m0 = lmer( dv ~ (1|rand) + iv1 + iv2 ) m1 = lmer( dv ~ (1|rand) +
>>> iv1 + iv2 + iv1:iv2 ) AIC(m0)-AIC(m1)
>>> 
>>> Evaluation of evidence for or against a random effect: m0 = lmer(
>>> dv ~ (1|rand1) + 1 ) m1 = lmer( dv ~ (1|rand1) + (1|rand2) + 1 ) 
>>> AIC(m0)-AIC(m1)
>>> 
>>> Evaluation of evidence for or against correlation between the 
>>> intercept and slope of a fixed effect that is allowed to
>> vary within
>>> levels of the random effect: m0 = lmer( dv ~ (1+iv|rand) + iv ) 
>>> m1 = lmer( dv ~ (1|rand) + (0+iv|rand) + iv ) AIC(m0)-AIC(m1)
>>> 
>>> Certainly I've already encountered uncertainty in this approach
>>> in that I'm unsure whether AIC() or BIC() is more appropriate for
>>>  correcting the likelihood estimates to account for the
>>> differential complexity of the models involved in these types of
>>> 
>> comparisons. I get
>>> the impression that both corrections were developed in the
>> context of
>>> exploratory research where model selection involves many models 
>>> involving multiple usually observed variables (vs
>> manipulated), so I
>>> don't have a good understanding of how their different 
>>> derivations/intentions apply to this simpler context of
>> comparing two
>>> nested models to determine evidence for a specific effect
>> of interest.
>>> 
>>> I would greatly appreciate any thoughts on this AIC/BIC
>> issue, or any
>>> other complexities that I've overlooked in my proscription
>> to abandon
>>> p-values in favor of the likelihood ratio (at least, for all 
>>> non-decision-making scientific applications of data analysis).
>> 
>> I don't see why you're using AIC differences here.  If you want to
>> test hypotheses, you should use the likelihood ratio (with or
>> without ascribing a p-value to it)!  The AIC was designed to
>> estimate the expected predictive accuracy of a model on
>> out-of-sample data (as measured by the Kullback-Leibler distance);
>> the BIC is designed to approximate the probability that a model is
>> the 'true' model. AIC is a shortcut that is strongly favored by
>> ecologists (among others) because it is easy, but it does not do
>> what they are usually trying to do and what I see you trying to do
>>  above, i.e. test for evidence of an effect.
> 
> I would think that an improved predictive ability in a model with an
> effect as compared to the same ability of a model w/o the said
> effect, is evidence for the predictive ability of that effect,
> conditional on not being able to formulate the true model.

  We all have different philosophies on this; I think that all the
people commenting here have *reasonable* approaches.  My own personal
feeling is that testing hypotheses is different from trying to find the
best predictive model.  In relatively extreme cases (either the
additional complexity of the full model is completely useless or it is
very valuable), then all the different ways of testing models (AIC, BIC,
Bayes factor, p-value, likelihood ratio ...) agree.  In borderline cases
they don't, and then I think it's useful to have the clearest possible
statement of what one is trying to do.

> 
>> If one is really trying to test for "evidence of an effect" I see
>> nothing wrong with a p-value stated on the basis of the null
>> distribution of deviance differences between a full and a reduced
>> model
> 
> Probably the wrong part is that calculating p-values entails going
> back to the sample space, making the inference not fully conditional
> on the current sample, whereas when using the AIC to compare the
> models, the inferential statements are fully conditional on the
> current sample.
> 
> [respectfully snipping some interesting afterthoughts by BB]
> 
> I remember George Barnard wrote somenthing insightful about this,
> something like "Probabilities are useful before an experiment is
> carried out, when we are surveying possible results that might be
> obtained, but once the sample has been obtained, there is no point in
> continuing to think about probabilities; at this point our statements
> have to be couched in terms of likelihood" . It was in his 1951 paper
> in the JRSS and the quote is from memory so it is guaranteed to be
> inaccurate. Pratt also wrote something cool about this when
> commenting a book by Lehman.
> 
> Rubén
> 
> _______________________________________________ 
> R-sig-mixed-models at r-project.org mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models