# [R-sig-ME] lmer: ML and REML estimation

Douglas Bates bates at stat.wisc.edu
Thu Mar 26 22:36:21 CET 2009

On Wed, Mar 25, 2009 at 5:21 PM, Ben Bolker <bolker at ufl.edu> wrote:
> Douglas Bates wrote:
>
>> I would claim that maximum likelihood estimates are well-defined for
>> generalized linear mixed models but REML estimates are not. (It is
>> true that Mary Lindstrom and I did offer a definition of REML
>> estimates for nonlinear mixed-effects models but I consider that a
>> youthful indiscretion and I didn't inhale. :-)
>>
>> The bottom line is that REML only makes sense for linear mixed-effects models.
>
>  Thank you!  Good to see that clarified.  Looks like we got it
> wrong, or at least misleading, in our recent TREE paper -- oh well,
> science marches on.
>
>  Might anyone here be able to point to a citation (other than "D.
> Bates, r-sig-mixed-models mailing list, 25 March 2009) that would
> support this statement ... ?
>
>  For what it's worth (risking the wrath of the GODS), PROC NLMIXED
> doesn't try to do REML, but claims it's a computational issue rather
> than one of definition:
>
> "With PROC MIXED you can perform both maximum likelihood and restricted
> maximum likelihood (REML) estimation, whereas PROC NLMIXED only
> implements maximum likelihood. This is because the analog to the REML
> method in PROC NLMIXED would involve a high dimensional integral over
> all of the fixed-effects parameters, and this integral is typically not
> available in closed form."
> <http://www.sfu.ca/sasdoc/sashtml/stat/chap46/sect4.htm>

Interesting.  (Also interesting that the URL is to Simon Fraser
University in British Columbia and not SAS Institute but whatever
works.)

My concept of REML is more like what Peter Dalgaard described in a
later response on this thread.  I consider the REML criterion to be
related to the decomposition of the response space into the linear
subspace spanned by the columns of the fixed-effects model matrix and
the orthogonal to this subspace.   It happens this is equivalent to
using the integral of the likelihood over the fixed effects as the
estimation criterion but that relationship depends on the mean
response being a linear function of the fixed-effects parameters, and
on the probability model generating a Euclidean metric.  In
generalized linear models the metric is non-Euclidean and the mean is
related to the linear predictor through a nonlinear inverse link
function.

It is not obvious to me that integrating the likelihood over the
fixed-effects would produce a meaningful estimation criterion for
GLMMs.  I'm not saying that it wouldn't but I haven't thought through
the ramifications.  It wouldn't correspond to what I would recognize
as "residual" maximum likelihood because there is no cleanly defined
residual space.

> On the other hand, GLIMMIX lets you go ahead and hang yourself:
> GLIMMIX allows the use of restricted maximum likelihood (REML) methods,
> which have been shown to produce better estimates than full maximum
> likelihood (ML) when the number of higher-level units is small. REML is
> not available in NLMIXED."
> <www.nesug.org/proceedings/nesug06/an/da08.pdf>

> This is shortly after stating that
>                                                       "GLIMMIX, in
> contrast, can produce potentially biased estimates for both fixed
> effects and covariance parameters, especially for binary data

Bias is a funny beast and I wish we were not so inclined to divide
estimators into unbiased and biased.  First, bias is a binary
property.  An estimator is either biased or unbiased, which provides
you with exactly 1 bit of information (in the information-theoretic
sense) about the estimator.  Second, it is based on an expected value,
which is an integral.  Thus it depends upon the scale on which you are
evaluating the parameter.  In the "i.i.d. sample from a Gaussian
distribution" case S^2 is an unbiased estimate of \sigma^2 but S is
not an unbiased estimate of \sigma.

I have the feeling that the REML versus ML estimator (I try to avoid
the phrase "full maximum likelihood" - to me there is only one maximum
likelihood estimator for a given probability model and I regret that
phrases of the form "<adjective> maximum likelihood" have crept into
the nomenclature) may be important for the point estimator but may not
be as important for the distribution of the estimator but I don't have
any evidence to back that up.

>  Does anyone out there have a suggestion/defense for when it *is*
> acceptable to use PQL/MQL to fit binary GLMMs?

How did PQL get into this discussion?  I think of PQL as an
optimization algorithm that is intended to produce the maximum
likelihood estimates by an indirect optimization.  If you want to
define a criterion based on integrating the likelihood over the
fixed-effects that is one thing, but I don't think that has to lead
you to the PQL method.

>  A further question: do you think it will generally be true that ML
> estimates of random effects variances will be slightly biased downwards
> because we don't have an analogue of REML?  (A wild guess, but I would
> think that the mean of the posterior distribution of the variance
> estimate (with uninformative priors) would be unbiased since it averages
> across the variation in the estimate of the fixed effects????)

Is it true even for linear mixed models that maximum likelihood
estimates of variance components are biased downward? (And what would
happen if instead of estimating a variance component you estimated the
logarithm of a variance or the square root of a variance?)  I would
try a few examples and see what happens.  I have seen cases where the
ML estimates of a particular variance are actually greater than the
REML variance estimate for that component.  Most of the time there is
at least one variance component whose ML estimate is smaller than the
REML estimate but it doesn't have to be so for all the components.
Strangely, the REML and ML estimates of the residual variance are
often similar and that is one case where I know there is a smaller
denominator for REML than for ML.  It appears that the numerator,
which is the penalized residual sum of squares at the parameter
estimates, changes more-or-less proportionally to the denominator.