[R-sig-ME] p-values vs likelihood ratios

Mon Feb 21 15:18:43 CET 2011

> -----Mensaje original-----
> De: r-sig-mixed-models-bounces at r-project.org 
> [mailto:r-sig-mixed-models-bounces at r-project.org] En nombre 
> de Ben Bolker
> Enviado el: lunes, 21 de febrero de 2011 14:25
> Para: r-sig-mixed-models at r-project.org
> Asunto: Re: [R-sig-ME] p-values vs likelihood ratios
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 11-02-21 12:09 AM, Mike Lawrence wrote:
> > Hi folks,
> > 
> > I've noticed numerous posts here that discuss the 
> appropriateness of 
> > p-values obtained by one method or another in the context of mixed 
> > effects modelling. Following these discussions, I have an 
> observation
> > (mini-rant) then a question.
> > 
> > First the observation:
> > 
> > I am not well versed in the underlying mathematical 
> mechanics of mixed 
> > effects modelling, but I would like to suggest that the apparent 
> > difficulty of determining appropriate p-values itself may be a sign 
> > that something is wrong with the whole idea of using mixed effects 
> > modelling as a means of implementing a null-hypothesis testing 
> > approach to data analysis. That is, despite the 
> tradition-based fetish 
> > for p-values generally encountered in the peer-review process, null 
> > hypothesis significance testing itself is inappropriate for 
> most cases 
> > of data analysis. p-values are for politicians; they help inform 
> > one-off decisions by fixing the rate at which one specific type of 
> > decision error will occur (notably ignoring other types of decision 
> > errors). Science on the other hand is a cumulative process that is 
> > harmed by dichotmized and incomplete representation of the data as 
> > null-rejected/fail-to-reject-the-null. Data analysis in 
> science should 
> > be about quantifying and comparing evidence between models of the 
> > process that generated the data. My impression is that the 
> likelihood 
> > ratio (n.b. not likelihood ratio *test*) is an easily computed 
> > quantity that facilitates quantitative representation of such 
> > comparison of evidence.
> 
>   Yes, although I don't personally think there's anything 
> fundamentally wrong with p values when used properly (I know 
> Royall (1993) states that even in the Fisherian 'strength of 
> evidence' framework they are flawed ...)
> 
> > 
> > Now the question:
> > 
> > Am I being naive in thinking that there are no nuances to the 
> > computation of likelihood ratios and appropriateness of their 
> > interpretation in the mixed effects modelling context? To provide 
> > fodder for criticism, here are a few ways in which I 
> imagine computing 
> > then interpreting likelihood ratios:
> > 
> > Evaluation of evidence for or against a fixed effect:
> > m0 = lmer( dv ~ (1|rand) + 1 )
> > m1 = lmer( dv ~ (1|rand) + iv )
> > AIC(m0)-AIC(m1)
> > 
> > Evaluation of evidence for or against an interaction 
> between two fixed effects:
> > m0 = lmer( dv ~ (1|rand) + iv1 + iv2 )
> > m1 = lmer( dv ~ (1|rand) + iv1 + iv2 + iv1:iv2 )
> > AIC(m0)-AIC(m1)
> > 
> > Evaluation of evidence for or against a random effect:
> > m0 = lmer( dv ~ (1|rand1) + 1 )
> > m1 = lmer( dv ~ (1|rand1) + (1|rand2) + 1 )
> > AIC(m0)-AIC(m1)
> > 
> > Evaluation of evidence for or against correlation between the 
> > intercept and slope of a fixed effect that is allowed to 
> vary within 
> > levels of the random effect:
> > m0 = lmer( dv ~ (1+iv|rand) + iv )
> > m1 = lmer( dv ~ (1|rand) + (0+iv|rand) + iv )
> > AIC(m0)-AIC(m1)
> > 
> > Certainly I've already encountered uncertainty in this approach in 
> > that I'm unsure whether AIC() or BIC() is more appropriate for 
> > correcting the likelihood estimates to account for the differential 
> > complexity of the models involved in these types of 
> comparisons. I get 
> > the impression that both corrections were developed in the 
> context of 
> > exploratory research where model selection involves many models 
> > involving multiple usually observed variables (vs 
> manipulated), so I 
> > don't have a good understanding of how their different 
> > derivations/intentions apply to this simpler context of 
> comparing two 
> > nested models to determine evidence for a specific effect 
> of interest.
> > 
> > I would greatly appreciate any thoughts on this AIC/BIC 
> issue, or any 
> > other complexities that I've overlooked in my proscription 
> to abandon 
> > p-values in favor of the likelihood ratio (at least, for all 
> > non-decision-making scientific applications of data analysis).
> 
>   I don't see why you're using AIC differences here.  If you 
> want to test hypotheses, you should use the likelihood ratio 
> (with or without ascribing a p-value to it)!  The AIC was 
> designed to estimate the expected predictive accuracy of a 
> model on out-of-sample data (as measured by the 
> Kullback-Leibler distance); the BIC is designed to 
> approximate the probability that a model is the 'true' model. 
>  AIC is a shortcut that is strongly favored by ecologists 
> (among others) because it is easy, but it does not do what 
> they are usually trying to do and what I see you trying to do 
> above, i.e. test for evidence of an effect.

I would think that an improved predictive ability in a model with an effect as compared to the same ability of a model w/o the said effect, is evidence for the predictive ability of that effect, conditional on not being able to formulate the true model.

>    If one is really trying to test for "evidence of an 
> effect" I see nothing wrong with a p-value stated on the 
> basis of the null distribution of deviance differences 
> between a full and a reduced model

Probably the wrong part is that calculating p-values entails going back to the sample space, making the inference not fully conditional on the current sample, whereas when using the AIC to compare the models, the inferential statements are fully conditional on the current sample.

[respectfully snipping some interesting afterthoughts by BB]

I remember George Barnard wrote somenthing insightful about this, something like "Probabilities are useful before an experiment is carried out, when we are surveying possible results that might be obtained, but once the sample has been obtained, there is no point in continuing to think about probabilities; at this point our statements have to be couched in terms of likelihood" . It was in his 1951 paper in the JRSS and the quote is from memory so it is guaranteed to be inaccurate. Pratt also wrote something cool about this when commenting a book by Lehman.

Rubén