[R-sig-ME] Log likelihood of a glmer() binomial model .

Sat Apr 20 02:44:06 CEST 2019

  This seems wrong.  The GLMM log-likelihood includes an integral over
the distribution of the random effects.  If you use `nAGQ=0` (which
skips the integral) you *might* get the same answer (I think there's
no additional additive constant to be included for a Bernoulli
response?  The likelihood of a bernoulli is p or 1-p, no normalizing
constants needed ...)  Otherwise I think you might have to hack
around.

 Here is an **inefficient** method for computing the likelihood

   coefs <- unlist(getME(fit,c("theta","beta"))
   newdev <- update(fit, data=VS, devFunOnly=TRUE)
   newdev(coefs)

This is slow because it has to reconstruct all of the random-effects
matrices, do permutations to reorder the relevant matrices to be as
sparse as possible, etc. etc.

  It would be nice if we had a more efficient form of update(), but in
this case I'm not even sure it's easy/ possible (since the structure
of the Z matrices could change a lot when we go from the training to
the test data).

refit() efficiently rebuilds a new deviance function that uses a new
response variable, but it relies on everything else in the model
staying the same (especially, structure of the  matrices).
Before I realized it wouldn't help, I made a quick lme4 branch that
adds a "devFunOnly" argument to the refit.merMod method:
https://github.com/lme4/lme4/tree/refit_dfonly

  If I get feedback that this would be useful for someone I may merge
it back to the master branch ...

On Fri, Apr 19, 2019 at 7:22 PM Rolf Turner <r.turner using auckland.ac.nz> wrote:
>
>
> I am trying to implement cross-validated likelihood (see e.g. "Model
> selection for probabilistic clustering using cross-validated
> likelihood", P. Smyth, 2000, Statistics and Computing vol. 10, pp.
> 63--72) for model selection in the glmer() binomial family context.
>
> Briefly what I do is:
>
>     * divide the data into a "training set" and a "validation set"
>       (e.g. 80% and 20%)
>     * fit the model of interest to the training set *only*
>     * calculate the log-likelihood of the validation set on the
>       basis of the model fitted to the training set
>
> It is the last step about which I have some concern.  I calculate
> this log likelihood as
>
>      sum(log(predict(fit,newdata=VS,type="response")))
>
> where "fit" is the model fitted to the training set and "VS" is the
> validation set.
>
> I have the uneasy feeling that I may well be doing something stupidly
> naïve here, but I can't see anything obviously wrong with what I am doing.
>
> I have observed that, if I execute
>
>      sum(log(predict(fit,type="response")))
>
> ostensibly calculating the log likelihood of "fit" for the data set from
> which "fit" was obtained, I get a very different value from that which
> is obtained from executing logLik(fit).   This does not *necessarily*
> imply, however, that my method is wrong, since the log likelihood is
> "unique only up to an additive constant".  I cannot however see how to
> work out what this additive constant might be equal to, so that I can check.
>
> Can anyone enlighten me as to whether my log likelihood is "correct"?
>
> And if not, could someone suggest a *correct* means of calculating a
> "cross-validated" log likelihood?
>
> Thanks.
>
> cheers,
>
> Rolf Turner
>
> --
> Honorary Research Fellow
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276
>
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models