[R-sig-ME] lmer() producing wildly inaccurate values for some GLM models

Fri Oct 19 10:44:15 CEST 2007

Hello,

The data I am trying to model are positive and right-skewed (they're
measurements of fixation times recorded from an eye-tracker), so I've
been trying to use lmer's GLM fitting capability to model them.  In
particular, I've been investigating whether it makes most sense to
use something like a gamma or inverse gaussian distribution with an
identity link (since theory and examination of the data both suggest
that the mean of the response is in fact linear in my predictors of
interest).  I was hoping to use log likelihood to work out which
model made the most sense (e.g., using the approach of Vuong 1989),
but... something seemed funny about the numbers I was getting for the
likelihood, and also the random effects, so I did some testing.

lmer's log-likelihood calculations for these models appear to be
completely wrong.  In particular, the log likelihood that it
calculates seems to be many orders of magnitude smaller than it could
possibly be, and orders of magnitude smaller when fitting an inverse
gaussian distribution than when fitting a gamma distribution, even if
the data in fact are closer to a gamma distribution.

To demonstrate the problem, I generated some toy data where there the
true random effects are null.  This isn't necessary to cause the
problem (i.e., lmer isn't just breaking because the model is
over-specified); I did things this way because it makes the models
fit by lmer and glm more directly comparable.

This code takes 10,000 samples from the true model for a Gamma GLM
with E(y) = 200 + 3*x, identity link, and shape parameter 5, then
attempts to fit it in various ways:

---------------------------->8----------------------------

set.seed(0)
samples <- 10000
true.shape <- 5
x <- rgamma(samples, 2, scale=2)
i <- 200
b <- 3
true.scale <- (i + b*x)/true.shape
y <- rgamma(samples, true.shape, scale=true.scale)
subj <- factor(c(rep("a", samples/2), rep("b", samples/2)))
gaussian.lm <- lm(y ~ x)
gaussian.lmer <- lmer(y ~ x + (x | subj))
gamma.glm <- glm(y ~ x, family=Gamma(link="identity"))
gamma.lmer <- lmer(y ~ x + (x | subj), family=Gamma(link="identity"))
ig.glm <- glm(y ~ x, family=inverse.gaussian(link="identity"))
ig.lmer <- lmer(y ~ x + (x | subj),
                family=inverse.gaussian(link="identity"))
nums.glm <- function(model) c(coef(model), logLik=logLik(model))
nums.lmer <- function(model) c(fixef(model), logLik=logLik(model))
rbind(gaussian.lm=nums.glm(gaussian.lm), gaussian.lmer=nums.lmer(gaussian.lmer),
      gamma.glm=nums.glm(gamma.glm), gamma.lmer=nums.lmer(gamma.lmer),
      ig.glm=nums.glm(ig.glm), ig.lmer=nums.lmer(ig.lmer))

----------------------------8<----------------------------

Output:

---------------------------->8----------------------------

              (Intercept)        x        logLik
gaussian.lm      203.0480 2.686375 -59761.426766
gaussian.lmer    203.0279 2.693471 -59759.572897
gamma.glm        202.8143 2.745581 -59054.739626
gamma.lmer       202.8143 2.745464  -1025.060862
ig.glm           202.7024 2.774986 -59309.711506
ig.lmer          202.7024 2.774986     -5.788107

----------------------------8<----------------------------

So, as you can see:
  -- Every fitting function does well at recovering the correct
     regression coefficients (and furthermore there is excellent
     agreement between (g)lm and lmer -- lmer correctly identifies
     the lack of random effects in all cases).
  -- The (g)lm log-likelihoods are all plausible (also I checked the
     one for gamma by hand, and it is correct), and they correctly
     identify that this is a gamma rather than inverse-gaussian
     distribution
  -- The lmer log-likelihood for the gaussian model seems accurate
  -- The lmer log-likelihoods for the other two models are completely
     bonkers (I *wish* I had a model that achieved a log likelihood of
     -5.8 over 10,000 samples...)

The lmer's reported AIC, BIC, and deviance appear to be similarly
underestimated.

Additionally, the random effects calculated by lmer for these models
seem very fishy, but I'm not as confident in my analysis here,
because I'm not entirely sure my toy data is correctly sampling from
the true distribution that lmer assumes.  Anyway, here's the code,
which is attempting to create a model like that above, but this time
with 10 subjects, each of whose intercept and slope terms are sampled
independently from normal distributions with different variances --
400 and 1, respectively:

---------------------------->8----------------------------

set.seed(0)
subjects <- 10
samples <- 1000
true.shape <- 5
i <- 200
i.var <- 400
b <- 3
b.var <- 1
x <- NULL
subj <- NULL
true.scale <- NULL
for (s in 1:subjects) {
  subj <- c(subj, rep(s, samples))
  x.subj <- rgamma(samples, 2, scale=2)
  x <- c(x, x.subj)
  i.subj <- rnorm(1, i, sqrt(i.var))
  b.subj <- rnorm(1, b, sqrt(b.var))
  true.scale <- c(true.scale, (i.subj + b*x.subj)/true.shape)
}
subj <- factor(subj)
y <- rgamma(subjects * samples, true.shape, scale=true.scale)
gaussian.lm <- lm(y ~ x)
gaussian.lmer <- lmer(y ~ x + (x | subj))
gamma.glm <- glm(y ~ x, family=Gamma(link="identity"))
gamma.lmer <- lmer(y ~ x + (x | subj), family=Gamma(link="identity"))
ig.glm <- glm(y ~ x, family=inverse.gaussian(link="identity"))
ig.lmer <- lmer(y ~ x + (x | subj),
family=inverse.gaussian(link="identity"))
nums.glm <- function(model) c(coef(model), logLik=logLik(model))
nums.lmer <- function(model) c(fixef(model), logLik=logLik(model))
rbind(gaussian.lm=nums.glm(gaussian.lm), gaussian.lmer=nums.lmer(gaussian.lmer),
      gamma.glm=nums.glm(gamma.glm), gamma.lmer=nums.lmer(gamma.lmer),
      ig.glm=nums.glm(ig.glm), ig.lmer=nums.lmer(ig.lmer))
VarCorr(gaussian.lmer)$subj
VarCorr(gamma.lmer)$subj
VarCorr(ig.lmer)$subj

---------------------------->8----------------------------

Output:

---------------------------->8----------------------------

              (Intercept)        x        logLik
gaussian.lm      192.1295 2.636897 -59432.332233
gaussian.lmer    192.0867 2.648113 -59257.574632
gamma.glm        192.3558 2.579673 -58668.208775
gamma.lmer       192.1256 2.578616  -1038.636243
ig.glm           192.4572 2.553089 -58910.126512
ig.lmer          192.4572 2.553093     -6.340702
> VarCorr(gaussian.lmer)$subj
2 x 2 Matrix of class "dpoMatrix"
            (Intercept)          x
(Intercept)   252.21320 11.4298584
x              11.42986  0.5179811
> VarCorr(gamma.lmer)$subj
2 x 2 Matrix of class "dpoMatrix"
            (Intercept)          x
(Intercept)   41.109713 1.63892630
x              1.638926 0.06795769
> VarCorr(ig.lmer)$subj
2 x 2 Matrix of class "dpoMatrix"
              (Intercept)             x
(Intercept)  1.538172e-06 -8.630771e-19
x           -8.630771e-19  5.124452e-13

---------------------------->8----------------------------

The first part just duplicates what I found above, to show that it
still arises with models that are not overspecified.

The second part is the variance-covariance matrices -- in no case are
the estimates anywhere near accurate, and for the inverse gaussian
model it somehow has decided that there are effectively no random
effects at all.  I see the same behavior on my real data -- that is,
random effects for the inverse gaussian model coming out to
effectively zero.  I can't say if the estimates for the gamma model
are also incorrect, because I don't know what the true values are
:-).  But this is not giving me confidence that the values I'm getting
are anywhere near true, either :-(.

This is all with R 2.6.0 on x86-64 linux, with lme4 0.99875-8.

Help!
-- Nathaniel

-- 
So let us espouse a less contested notion of truth and falsehood, even
if it is philosophically debatable (if we listen to philosophers, we
must debate everything, and there would be no end to the discussion).
  -- Serendipities, Umberto Eco