[R-sig-ME] lmer vs glmmPQL

Wed Jun 24 23:51:37 CEST 2009

Roger Levy wrote:
> Ben Bolker wrote:
>> Daniel Ezra Johnson wrote:
>>> Regarding this thread, what about the method of fitting nested models
>>> and using anova() to estimate a p-value.
>>>
>>> For example:
>>>
>>> mod2 = lmer(y ~ genotype + (1|block), family = binomial, data)
>>> mod0 = lmer(y ~ (1|block), family = binomial, data)
>>>
>>> anova(mod0,mod2)
>>>
>>> How does that p-value (which is one value for the whole term
>>> "genotype") relate to the individual coefficient p-values derived from
>>> the Z-scores inside summary(mod2)?
>>>
>>> Thanks,
>>> Dan
>>    The Z-scores are Wald tests.  The anova results are likelihood
>> ratio tests.  The latter are in general more reliable but are known
>> to be *un*reliable for small-sample (i.e.
>> small-number-of-random-effects-levels) LMMs (Pinheiro and Bates).
>> Wald tests are not *known* to be unreliable in the small-sample
>> case, but I believe that is a statement of ignorance rather than
>> a statement that they're OK ...
> 
> This is an interesting issue -- as Ben points out, likelihood-ratio 
> tests are known to be anti-conservative for tests between linear mixed 
> models differing only in fixed-effects structure.  I think the 
> best-known demonstration of this can be found in Pinheiro & Bates, 2000, 
> pages 87-92.  F-tests are less problematic.

  But note that F tests may not be available or well-defined for
GLMMs ...  (Distinguishing here between conditional F tests (e.g. p. 90
of Pinheiro & Bates 2000) and Wald F tests of parameter combinations --
as far as I know the former are not applicable to GLMMs.
Wald F tests are always available but don't necessarily (???) share
the same properties/reliability as the conditional F tests in the LMM
case ...)

> For logit mixed models, presumably the results will depend a bit on how 
> fitting is done because the likelihood has to be approximated.  For the 
> Laplace approximation currently in use in lme4, I've done some 
> simulations and generally found that likelihood-ratio tests are still 
> anti-conservative; tests based on Wald Z are not as anti-conservative, 
> though they don't seem quite nominal.  

  Hmmm.  Can't we separate estimation (for which Laplace and AGQ are
probably adequate) from inference?  Would there be any point in
considering inference on estimates from a less accurate (PQL)
approximation scheme as long as we have more accurate schemes available?

  I worry whether the Wald Z tests are consistently better or
just "differently wrong" ...

You can do simulations along
> these lines:
> 
> set.seed(2)
> invlogit <- function(x) {
>    temp <- exp(x)
>    return(temp/(1+temp))
> }
> L <- 10
> REP <- 4
> N <- 4*10
> fac <- gl(L,REP)
> f <- function(ignore) {
>    x <- runif(N)
>    r <- rnorm(L)
>    y <- rbinom(N,1,invlogit(r[fac]))
>    m0 <- lmer(y ~ (1 | fac), family="binomial")
>    m1 <- lmer(y ~ x + (1 | fac), family="binomial")
>    temp <- pnorm(fixef(m1)[2]/sqrt(vcov(m1)[2,2]))
>    p.z <- 2*min(temp, 1-temp)
>    p.lr <- 1-pchisq(as.numeric(2*(logLik(m1)-logLik(m0))),1)
>    return(c(p.z,p.lr))
> }
> K <- 5000
> p.vals <- sapply(1:K,f)
> apply(p.vals, 1, function(x) sum(x<0.05))/K
> # this gave me 0.0562 for Wald Z, 0.0608 for LRT
> 
> # Is Wald Z anti-conservative?  The below says "yes" with p=0.044
> 2*(1-pbinom(sum(p.vals[1,]<0.05), K, 0.05))
> 
> 
> Unlike what Ben states above, however, it's my intuition (and informal 
> observation on the basis of simulations) that what determines how 
> anticonservative the LRT is is not the number of random effect levels 
> but rather the ratio between the degrees of freedom involved in the LRT 
> to the residual degrees of freedom in the more general model.  The more 
> residual degrees of freedom (basically, the more observations), the less 
> anti-conservative the LRT is.  This can be assessed for a specific 
> dataset by fitting a null-hypothesis model, repeatedly generating new, 
> artificial responses from it, and looking at the distribution of LRTs on 
> the artificial responses (along the lines of P&B 2000).

    Your conclusion agrees with what Pinheiro and Bates say (p. 88:
"as the number of parameters being removed from the fixed effects
becomes large, compared to the total number of observations, this
inaccuracy in the reported p-values can be substantial").
I don't really know how to decide what is more relevant to
"asymptopia" -- if N is the total number of obs., n the number
of blocks (say m = number per block so N=mn), and p the number of
parameters (fixed? random? variances counting as 1, or n-1, or
something in between?), is it 1-p/N [as PB suggest], or (N-p) [suggested
above], or n? Demidenko p. 145 shows that for LMMs  N \to \infty is
not sufficient for consistency, that n \to \infty is also required ...
I don't know if that helps in practice.

  I don't know where my intuition comes from -- perhaps it's just
conservatism/pessimism, since it's often easier to get N>>p than to get
n large ...

  I do agree that the gold standard is simulations as suggested above.
There are examples in the "worked example" section of glmm.wikidot.com .
These simulations can be very slow -- hopefully (??) Doug Bates will be
able to work out an mcmcsamp() scheme for GLMMs soon, and we can see
if that (much faster) approach generally agrees with null simulations ...

@book{demidenko_mixed_2004,
	edition = {1},
	title = {Mixed Models: Theory and Applications},
	isbn = {0471601616},
	publisher = {{Wiley-Interscience}},
	author = {Eugene Demidenko},
	month = jul,
	year = {2004}
}

> 
> Roger
> 
> 

-- 
Ben Bolker
Associate professor, Biology Dep't, Univ. of Florida
bolker at ufl.edu / www.zoology.ufl.edu/bolker
GPG key: www.zoology.ufl.edu/bolker/benbolker-publickey.asc