[R-sig-ME] nAGQ > 1 in lme4::glmer gives unexpected likelihood

Sun Apr 26 18:06:05 CEST 2020

Dimitris handled some of the practical issues, so I'm going to try to
handle some of the underlying math issues based on my very rusty
collection of measure theory .... hopefully I'll fix more "wrongs" than
I create and not incur the wrath of the math-stats deities ....

On 26/04/2020 01:23, Rolf Turner wrote:
>
> <SNIP>

> OK.  Let me display my ignorance:  I was always under the impression
> that likelihood is defined with respect to an *underlying measure*.
> Change the measure and you get a different likelihood (with the log
> likelihoods differing by a constant).
>
> To beat it to death:  Pr(X <= x) = \int_{-\infty}^x f(x) d\mu(x)
> where f(x) is the density (which can be interpreted as the likelihood)
> of X *with respect to* the measure \mu().
>
> E.g. if you have a random variable X with probability mass function
> P(x;\theta) defined, say, on the positive integers, you could make use
> of the atomic measure \mu_1() having point mass 1 at each positive
> integer, or you could make use of the atomic measure \mu_2() having
> point mass 1/2^n at each positive integer n. 

While the Kolmogorov axioms are expressed in terms of general measure
theory, we almost always skip the Lebesque integral and just use the
Riemann integral, with the associated and implied usual measure on the
real line (i.e. the one corresponding to the L2-norm). While we
certainly use other norms in stats (and the Lp norms are a great way to
think about things like regularization or even the deep relationship
between mean, median and mode), we almost always think of the
probabilities in terms of the usual measure on the real line.

On top of that, the "measures" you defined aren't the usual types of
things you see in measure theory. You seem to have blurred the line
between (or even perhaps convolved) your PDF and your measure. (I'm
looking at my copy of Kolmogorov and Fomin's Introductory Real Analysis
and really not wanting to take the time to make all of my statements
rigorous). This brings me to what I suspect is the actual problem:

The entire integral, including the associated measure, is what defines
the probability model, and it is for a given probability model that the
likelihood is unique. Indeed, that's the whole point of Fisher's work on
likelihood: you create something that is a function of the data for a
given probability model, including associated parameters. Since we can't
change the data, we change the parameters of the probability model to
maximize the likelihood. (This is being slightly infelicitous about the
fine print on P(x|θ)vs L(θ|x)and the difference between the probability
and scoring function).

The definition of (G)LMM that we use defines our probability model and
thus unique likelihood. The real fun comes in figuring out an efficient
way to evaluate the profiled log likelihood, which is where a lot of the
innovation in lme4 and MixedModels occurs -- using sparse matrix methods
on a penalized least squares problem instead of generalized least squares.

By the way, I think the problem that you're getting at with your measure
example is actually something not dissimilar to importance sampling.

One more thing to comment on below the fold ....

> <SNIP>
> Moreover, is it not the case that the likelihood that one gets from
> fitting a linear model with no random effects, using lm(), is not
> (directly) comparable with the likelihood obtained from fitting a
> linear mixed model using lmer()?  (Whence one cannot test for the
> "significance" of a random effect by comparing the two fits via a
> likelihood ratio test, as one might perhaps naïvely hope to do.)

Yes, you can compare the likelihoods between LM and LMM fits BUT the
likelihood-ratio test will be problematic because the LM fit corresponds
to an LMM fit with values at the edge of the parameter space (one or
more variance components equal to zero). Now, you could bootstrap this
test instead of assuming that the null distribution follows a
chi-squared distribution with an appropriate number of degrees of
freedom, then everything would be fine. I even have some example code
for this in Julia that I could potentially clean up and share. AIC/BIC
don't suffer from the same issues as the LRT (because they don't assume
a null distribution), but have their own difficulties and I don't want
to spend too much time discussing model selection here as enough ink and
arXiv space has been spilled over it.

I hope that helps and that I have fixed more misunderstandings than I
created.

Phillip

>
> Please enlighten me as to the ways in which my thinking is confused.
>
> cheers,
>
> Rolf
>

	[[alternative HTML version deleted]]