[R-sig-ME] p-values vs likelihood ratios

Mon Feb 21 15:45:50 CET 2011

On Mon, Feb 21, 2011 at 9:24 AM, Ben Bolker <bbolker at gmail.com> wrote:
>  I don't see why you're using AIC differences here.

My understanding it that taking the difference of the values resulting
from AIC() is equivalent to computing the likelihood ratio then
applying the AIC correction to account for the different number of
parameters in each model (then log-transforming at the end).

My original exposure to likelihood ratios (and the AIC/BIC correction
thereof) comes from Glover & Dixon (2004,
http://www.psych.ualberta.ca/~pdixon/Home/Preprints/EasyLRms.pdf), who
describe the raw likelihood ratio as inappropriately favoring the
model with more parameters because more complex models have the
ability to fit noise more precisely than less complex models. Hence
application of some form of correction to account for the differential
complexity of the models being compared.

I wonder, however, whether cross validation might be a less
controversial approach to achieving fair comparison of two models that
differ in parameter number. That is, fit the models to a subset of the
data, then compute the likelihoods on another subset of the data. I'll
play around with this idea and report back any interesting findings...

>   If one is really trying to test for "evidence of an effect" I see
> nothing wrong with a p-value stated on the basis of the null
> distribution of deviance differences between a full and a reduced model
> - -- it's figuring out that distribution that is the hard part. If I were
> doing this in a Bayesian framework I would look at the credible interval
> of the parameters (although doing this for multi-parameter effects is
> harder, which is why some MCMC-based "p values" have been concocted on
> this list and elsewhere).

We'll possibly have to simply disagree on the general utility of
p-values for cumulative science (as opposed to one-off decision
making). I do, however, agree that Bayesian credible intervals have a
role in cumulative science insofar as they permit a means of relative
evaluation of models that differ not in the presence of an effect but
in the specific magnitude of the effect, as may be encountered in more
advanced/fleshed-out areas of inquiry. Otherwise, in the context of
areas where the simple existence of an effect is of theoretical
interest, computing credible intervals on effects seems like overkill
and have (from my anti-p perspective) a dangerously easy connection to
null-hypothesis significance testing.