[R-sig-ME] p-values vs likelihood ratios

Mon Feb 21 23:54:24 CET 2011

Ben wrote"
"Yes, although I don't personally think there's anything fundamentally
wrong with p values when used properly (I know Royall (1993) states that
even in the Fisherian 'strength of evidence' framework they are flawed ...)"

If there is an informative prior, p-values are misleading.  Hence the
so-called prosecutors' fallacy.  Sure, it was unlikely that this DNA would
be found on this suspect, but . . .  See

http://en.wikipedia.org/wiki/Prosecutor's_fallacy

If p-values are useful at all (and I think they are), the range of contexts
in which they are useful is severely constrained.

In multi-level models, the denominators of F-like and t-like tests are
commonly linear combinations of variance estimates.  The Behrer-Fisher
phenomenon then arises.  p-values are not well defined, though it may
be possible to give intervals within which they must lie.  Maybe
"fundamentally wrong" is too strong.  Certainly, for multi-level models,
the Behrens-Fisher phenomenon pokes a hole in p-value based
reasoning. 

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On 22/02/2011, at 12:24 AM, Ben Bolker wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 11-02-21 12:09 AM, Mike Lawrence wrote:
>> Hi folks,
>> 
>> I've noticed numerous posts here that discuss the appropriateness of
>> p-values obtained by one method or another in the context of mixed
>> effects modelling. Following these discussions, I have an observation
>> (mini-rant) then a question.
>> 
>> First the observation:
>> 
>> I am not well versed in the underlying mathematical mechanics of mixed
>> effects modelling, but I would like to suggest that the apparent
>> difficulty of determining appropriate p-values itself may be a sign
>> that something is wrong with the whole idea of using mixed effects
>> modelling as a means of implementing a null-hypothesis testing
>> approach to data analysis. That is, despite the tradition-based fetish
>> for p-values generally encountered in the peer-review process, null
>> hypothesis significance testing itself is inappropriate for most cases
>> of data analysis. p-values are for politicians; they help inform
>> one-off decisions by fixing the rate at which one specific type of
>> decision error will occur (notably ignoring other types of decision
>> errors). Science on the other hand is a cumulative process that is
>> harmed by dichotmized and incomplete representation of the data as
>> null-rejected/fail-to-reject-the-null. Data analysis in science should
>> be about quantifying and comparing evidence between models of the
>> process that generated the data. My impression is that the likelihood
>> ratio (n.b. not likelihood ratio *test*) is an easily computed
>> quantity that facilitates quantitative representation of such
>> comparison of evidence.
> 
>  Yes, although I don't personally think there's anything fundamentally
> wrong with p values when used properly (I know Royall (1993) states that
> even in the Fisherian 'strength of evidence' framework they are flawed ...)
> 
>> 
>> Now the question:
>> 
>> Am I being naive in thinking that there are no nuances to the
>> computation of likelihood ratios and appropriateness of their
>> interpretation in the mixed effects modelling context? To provide
>> fodder for criticism, here are a few ways in which I imagine computing
>> then interpreting likelihood ratios:
>> 
>> Evaluation of evidence for or against a fixed effect:
>> m0 = lmer( dv ~ (1|rand) + 1 )
>> m1 = lmer( dv ~ (1|rand) + iv )
>> AIC(m0)-AIC(m1)
>> 
>> Evaluation of evidence for or against an interaction between two fixed effects:
>> m0 = lmer( dv ~ (1|rand) + iv1 + iv2 )
>> m1 = lmer( dv ~ (1|rand) + iv1 + iv2 + iv1:iv2 )
>> AIC(m0)-AIC(m1)
>> 
>> Evaluation of evidence for or against a random effect:
>> m0 = lmer( dv ~ (1|rand1) + 1 )
>> m1 = lmer( dv ~ (1|rand1) + (1|rand2) + 1 )
>> AIC(m0)-AIC(m1)
>> 
>> Evaluation of evidence for or against correlation between the
>> intercept and slope of a fixed effect that is allowed to vary within
>> levels of the random effect:
>> m0 = lmer( dv ~ (1+iv|rand) + iv )
>> m1 = lmer( dv ~ (1|rand) + (0+iv|rand) + iv )
>> AIC(m0)-AIC(m1)
>> 
>> Certainly I've already encountered uncertainty in this approach in
>> that I'm unsure whether AIC() or BIC() is more appropriate for
>> correcting the likelihood estimates to account for the differential
>> complexity of the models involved in these types of comparisons. I get
>> the impression that both corrections were developed in the context of
>> exploratory research where model selection involves many models
>> involving multiple usually observed variables (vs manipulated), so I
>> don't have a good understanding of how their different
>> derivations/intentions apply to this simpler context of comparing two
>> nested models to determine evidence for a specific effect of interest.
>> 
>> I would greatly appreciate any thoughts on this AIC/BIC issue, or any
>> other complexities that I've overlooked in my proscription to abandon
>> p-values in favor of the likelihood ratio (at least, for all
>> non-decision-making scientific applications of data analysis).
> 
>  I don't see why you're using AIC differences here.  If you want to
> test hypotheses, you should use the likelihood ratio (with or without
> ascribing a p-value to it)!  The AIC was designed to estimate the
> expected predictive accuracy of a model on out-of-sample data (as
> measured by the Kullback-Leibler distance); the BIC is designed to
> approximate the probability that a model is the 'true' model.  AIC is a
> shortcut that is strongly favored by ecologists (among others) because
> it is easy, but it does not do what they are usually trying to do and
> what I see you trying to do above, i.e. test for evidence of an effect.
>   If one is really trying to test for "evidence of an effect" I see
> nothing wrong with a p-value stated on the basis of the null
> distribution of deviance differences between a full and a reduced model
> - -- it's figuring out that distribution that is the hard part. If I were
> doing this in a Bayesian framework I would look at the credible interval
> of the parameters (although doing this for multi-parameter effects is
> harder, which is why some MCMC-based "p values" have been concocted on
> this list and elsewhere).
> 
>  Ben Bolker
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk1iZ6YACgkQc5UpGjwzenMBKgCgkDDdaOD2BqQQck6Nn8mM4YK0
> SCgAmwW+Fa9d5J8ht29gob+3jA1T/60s
> =ZRY5
> -----END PGP SIGNATURE-----
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models