[R-sig-ME] understanding log-likelihood/model fit

Thu Aug 21 02:51:22 CEST 2008

I find that a very insightful way to think about results from
Daniel's simulated data.  In the case of his null1 model,
failure to adjust for the fixed effect led to a large (huge
relative to the component in fixed1) between subjects
component of variance.  There was, then, a small penalty
for random subject effects whose distribution looked
nothing like normal, and the residuals looked remarkably
normal.  It is an extreme example of a case where the
residuals have scant diagnostic usefulness.

My concern with some of the Pinheiro and Bates plots is
that they look too soon at the residuals, before doing the
grosser checks that are needed that the model is half-way
correct, and that the points are not joined up a/c subject
or whatever, so that within subject systematic departures
from the model are not evident.  I seem to recall finding
an aberrant patient when I looked at the Dialyzer data
some considerable time ago, and/or an aberrant Orange
tree.  In a quick look, I was not immediately able to
recover the state of thinking that led me to this conclusion.
Irrespective of the application to these data, the point that
residuals are one part only of the relevant diagnostic
information, and may indeed not convey much at all of
the relevant information, stands.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

On 21 Aug 2008, at 5:53 AM, Douglas Bates wrote:

> On Wed, Aug 20, 2008 at 2:24 PM, Andrew Robinson
> <A.Robinson at ms.unimelb.edu.au> wrote:
>> I'll take a swing at these.
>>
>> On Wed, Aug 20, 2008 at 02:01:29PM +0100, Daniel Ezra Johnson wrote:
>>> Everyone agrees about what happens here:
>>>
>>> Nsubj <- 10
>>> Ngrp <- 2
>>> NsubjRep <- 5
>>> set.seed(123)
>>> test1s <- data.frame(subject = rep(seq(Nsubj * Ngrp), each =  
>>> NsubjRep),
>>>       response=500+c(rep(-100,Nsubj * NsubjRep),rep(100,Nsubj *
>>> NsubjRep))+rnorm(Nsubj * Ngrp * NsubjRep, 0, 10),
>>>       fixed=(rep(c("A","B"),each=Nsubj * NsubjRep)))
>>> null1 <- lmer(response~(1|subject),test1s)
>>> fixed1 <- lmer(response~fixed+(1|subject),test1s)
>>>
>>> I still have two questions which I'll try to restate. I should note
>>> that I have attempted to understand the mathematical details of ML
>>> mixed effect model fitting and it's a bit beyond me. But I hope that
>>> someone can provide an answer I can understand.
>>>
>>> Question 1: When you have an "outer" fixed effect and a "subject"
>>> random effect in the same model, specifically why does the model
>>> (apparently) converge in such a way that the fixed effect is  
>>> maximized
>>> and the random effect minimized? (Not so much why should it, as why
>>> does it? This is the 'fixed1' case.)
>
> I approach this from a slightly different point of view than does
> Andrew.  I prefer to think of the process of estimating the fixed
> effects parameters and the random effects as a penalized estimation
> problem where the penalty is applied to the random effects only.  The
> magnitude of the penalty depends on the (unconditional) variance
> covariance matrix of the random effects.  When the variances are small
> there is a large penalty.  When the variances are large there is a
> small penalty on the size of the random effects.  The measure of model
> complexity, which is related to the determinant of the conditional
> variance of the random effects, given the data, has the opposite
> behavior.  When the variance of the random effects is small the model
> is considered simpler.  The simplest possible model on this scale is
> one without any random effects at all, corresponding to a variance of
> zero.  The larger the variance of the random effects the more complex
> the model.
>
> The maximum likelihood criterion seeks a balance between model
> complexity and fidelity to the data.  Simple models fit the data less
> well than do complex models.
>
> The point to notice, however, is that the penalty on model complexity
> applies to the random effects only, not to the fixed effects.  From
> this point of view if the model can explain a certain pattern in the
> data either with fixed-effects parameters or with random effects there
> is an advantage in explaining it with the fixed effects.
>
>>
>> A common way for hierarchical models to be fit using ML is by
>> profiling out the fixed effects, estimating the random effects, and
>> then using GLS to estimate the fixed effects conditional on the  
>> random
>> effects.  So, any explanatory capacity that the fixed effects offer  
>> is
>> deployed before the random effects are invoked.
>>
>> Likewise a popular way to applying ReML is to fit the fixed effects
>> using OLS, then estimate the random effects from the residuals.
>> Again, the net effect is that any explanatory capacity that the fixed
>> effects offer is deployed before the random effects are invoked.
>>
>>> Question 2: Take the fixed1 model from Question 1 and compare it to
>>> the null1 model, which has a random subject effect but no fixed
>>> effect. The predicted values of the two models -- the ones from
>>> fitted(), which include the ranefs -- are virtually the same. So why
>>> does fixed1 have a lower deviance, why is it preferred to null1 in a
>>> likelihood ratio test? (Again, I'm not asking why it's a better  
>>> model.
>>> I'm asking questions about the software, estimation procedure, and/ 
>>> or
>>> theory of likelihood applied to such cases.)
>
> Because the likelihood involves a penalty on the size of the variance
> of the random effects.  Obtaining a similar fit (fidelity to the data)
> with a much lower variance on the random effects (the penalty on model
> complexity) is advantageous under the likelihood criterion.
>
>
>
>>
>> The deviance is computed from the log likelihood of the data,
>> conditional on the model.  The LL of the null model is maximized by
>> making the variance components big enough to cover the variation of
>> the data.  But, this means that the likelihood is being spread  
>> thinly,
>> as it were.  Eg ...
>>
>>> dnorm(0, sd=1)
>> [1] 0.3989423
>>> dnorm(0, sd=2)
>> [1] 0.1994711
>>
>> On the other hand, fixed1 uses fixed effects to explain a lot of that
>> variation, so that when the time comes to estimate the random  
>> effects,
>> they are smaller, and the LL is higher, because it doesn't have to
>> stretch so far.
>>
>> Cheers,
>>
>> Andrew
>>
>> --
>> Andrew Robinson
>> Department of Mathematics and Statistics            Tel:  
>> +61-3-8344-6410
>> University of Melbourne, VIC 3010 Australia         Fax:  
>> +61-3-8344-4599
>> http://www.ms.unimelb.edu.au/~andrewpr
>> http://blogs.mbs.edu/fishing-in-the-bay/
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>