[R-sig-ME] lmer nonconvergent: care to run and explain?

Wed Oct 21 19:47:58 CEST 2015

Summary: We can't compare the performance of algorithms; we can only
compare implementations of algorithms.  Algorithms such as EM and GLS are
unnecessary because they are doing more work, in some cases much more work,
than the penalized least squares (PLS) approach.

tl;dr

It is best not to confuse the criterion being optimized with the particular
optimization method being used and with the particular implementation of
that method.  Even though we might like to compare optimization algorithms
and often write as if we had done so, we can only compare implementations.

The way that I think about the model and its implementation in the lme4
package is described in Bates, Maechler, Bolker and Walker (2015),
http://www.jstatsoft.org/article/view/v067i01.  See

> citation("lme4")

To cite lme4 in publications use:

  Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015).
  Fitting Linear Mixed-Effects Models Using lme4. Journal of
  Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.

A BibTeX entry for LaTeX users is

  @Article{,
    title = {Fitting Linear Mixed-Effects Models Using {lme4}},
    author = {Douglas Bates and Martin M{\"a}chler and Ben Bolker and Steve
Walker},
    journal = {Journal of Statistical Software},
    year = {2015},
    volume = {67},
    number = {1},
    pages = {1--48},
    doi = {10.18637/jss.v067.i01},
  }

The probability model for linear mixed-effects and the derivation of
expressions for the log-likelihood and the REML criterion are given in
sections 1.1 and 3.1 to 3.4.  Note that once the probability model for the
responses is defined there is only one definition of the likelihood.  There
can be different formulas for it but they should all be equivalent.

Section 3.4 of that paper describes a way of evaluating the log-likelihood
and, more importantly, evaluating a profiled log-likelihood and a profiled
REML criterion.  The likelihood is a function of the parameters given the
observed data.  There are different ways of writing the parameters just as
there are different ways of writing the probability model.  We use the
fixed-effects parameter, β, the scale parameter, σ, of the per-observation
noise and a covariance parameter vector, θ, that determines the covariance
matrix, Σ, of the random effects.  The "profiled log-likelihood" is a
function of  θ only.  It is the minimum value of the log-likelihood over
all possible β and σ for that particular value of θ.  The point is that
this can be be evaluated quickly and accurately by solving a penalized
least squares (PLS) problem for which the normal equations are very similar
to "Henderson's mixed-model equations".  Similar but with important
differences.  The idea of profiling the log-likelihood was present in Bates
and DebRoy (2004) but not in quite the same form and, again, the
differences are important.

In the MixedModels package for Julia I take this one step further and avoid
computing the solution to the PLS problem.  It turns out that the
information needed to evaluate the profiled log-likelihood is available at
an intermediate step in the solution.

Determining the maximum likelihood estimates is a matter of optimizing the
profiled log-likelihood with respect to θ only.  This is quite different
from the GLS approaches described in the hierarchical linear models or
multilevel models literature.  EM algorithms, etc. (including the ones that
co-authors and I have proposed) can be very slow to converge.  In fact it
is often difficult to tell whether such an algorithm has converged.
Essentially they alternate between optimizing β given θ and optimizing θ
given β.  The profiled log-likelihood approach builds the optimization of  β
given θ into the evaluation of the criterion to be optimized.

Solving the PLS problem is a direct (i.e. non-iterative) calculation.  For
models with a large number of random effects the PLS solution requires
solving a large but very sparse positive definite system of equations.  In
lmer we use the sparse Cholesky decomposition, as implemented in the
CHOLMOD library of C functions by Tim Davis, to solve this.  In the
MixedModels package for Julia I use blocked matrix methods rather than
general sparse matrix methods.

To recap, in the lme4 package for R and in the MixedModels package for
Julia, determining the maximum likelihood estimates is a matter of
optimizing the profiled log-likelihood with respect to θ.  Usually the
dimension of θ is quite small.  Colin Longhurst and I have compared various
optimizers on several examples of linear mixed models from the literature
(including paulsim.json).  The dimension of θ for the different models is

Alfalfa.json:      "nopt": 1,
Animal.json:      "nopt": 2,
Assay.json:      "nopt": 2,
AvgDailyGain.json:      "nopt": 1,
AvgDailyGain.json:      "nopt": 1,
BIB.json:      "nopt": 1,
Bond.json:      "nopt": 1,
BS10.json:      "nopt": 20,
BS10.json:      "nopt": 8,
cake.json:      "nopt": 1,
Chem97.json:      "nopt": 2,
Chem97.json:      "nopt": 2,
Cultivation.json:      "nopt": 1,
d3.json:      "nopt": 9,
Demand.json:      "nopt": 2,
dialectNL.json:      "nopt": 6,
Dyestuff2.json:      "nopt": 1,
Dyestuff.json:      "nopt": 1,
egsingle.json:      "nopt": 2,
ergoStool.json:      "nopt": 1,
Exam.json:      "nopt": 1,
Exam.json:      "nopt": 1,
Gasoline.json:      "nopt": 1,
GB12.json:      "nopt": 20,
GB12.json:      "nopt": 8,
Genetics.json:      "nopt": 2,
HR.json:      "nopt": 3,
Hsb82.json:      "nopt": 1,
IncBlk.json:      "nopt": 1,
InstEval.json:      "nopt": 2,
InstEval.json:      "nopt": 3,
KB07.json:      "nopt": 72,
KB07.json:      "nopt": 16,
Mississippi.json:      "nopt": 1,
mm0.json:      "nopt": 6,
Oxboys.json:      "nopt": 3,
Pastes.json:      "nopt": 2,
paulsim.json:      "nopt": 2,
paulsim.json:      "nopt": 1,
PBIB.json:      "nopt": 1,
Penicillin.json:      "nopt": 2,
Poems.json:      "nopt": 3,
ScotsSec.json:      "nopt": 2,
Semi2.json:      "nopt": 2,
Semiconductor.json:      "nopt": 1,
SIMS.json:      "nopt": 3,
sleepstudy.json:      "nopt": 3,
sleepstudy.json:      "nopt": 2,
TeachingII.json:      "nopt": 1,
Weights.json:      "nopt": 3,
WWheat.json:      "nopt": 3,

For most of these examples the dimension of θ is 1, 2 or 3, which makes it
a very simple optimization problem.  The cases with a comparatively large
dimensional θ (say, 8 or more) are all examples of overparameterized
"maximal" models, in the sense of Barr et al. (2013), where convergence is
inevitably on the boundary of the parameter space, representing a singular
model.  If we ignore these meaningless models the problem of fitting linear
mixed-effects models can be reduced to a constrained optimization of a
function of a low-dimensional parameter vector, θ.

This is the point where you can only compare implementations, not
algorithms.   If you are going to compare, say, SAS vs Stata vs HLM5 vs
MLWin vs lmer vs lmm you must be careful to determine how the starting
values are set (I really wish I knew how SAS did this because they do it
very well) and, most importantly, what the convergence criteria are.   We
tend to be conservative in setting the convergence criteria, which means
that the algorithm gets into the neighborhood of the optimum quite quickly
but then requires many iterations to satisfy the convergence criteria.  The
easy way to cut down on the number of iterations is to relax the
convergence criteria.  For example, this is the trace of the iterations
when fitting a model to your simulated data

julia> fit!(lmm(Y ~ 1+S+T+U + (1|G) + (0+S|G), paulsim),true)
f_1: 144180.72425, [1.0,1.0]
f_2: 144274.57674, [1.75,1.0]
f_3: 144292.63941, [1.0,1.75]
f_4: 144067.78454, [0.25,1.0]
f_5: 143903.64334, [1.0,0.25]
f_6: 144446.67827, [0.292893,0.0]
f_7: 143529.46708, [1.30302,0.0290838]
f_8: 143877.77451, [2.05246,0.0]
f_9: 143795.73713, [1.19772,0.126175]
f_10: 143823.34211, [1.35678,0.0]
f_11: 143762.05604, [1.33438,0.097211]
f_12: 143691.53831, [1.24165,0.0721959]
f_13: 143814.84253, [1.23389,0.0]
f_14: 143644.50723, [1.3598,0.0523283]
f_15: 143587.62368, [1.29002,0.040578]
f_16: 143530.10373, [1.3131,0.0289719]
f_17: 143503.41482, [1.30555,0.0248514]
f_18: 143451.14914, [1.30452,0.0174227]
f_19: 143543.50332, [1.29842,0.00371975]
f_20: 143432.40261, [1.3115,0.0146834]
f_21: 143428.33727, [1.31289,0.0140573]
f_22: 143421.02525, [1.31567,0.01284]
f_23: 143416.50093, [1.31817,0.0119544]
f_24: 143415.43812, [1.31872,0.0117211]
f_25: 143415.77708, [1.31947,0.011772]
f_26: 143412.73718, [1.31834,0.0111002]
f_27: 143409.47407, [1.31848,0.00964196]
f_28: 143412.89517, [1.31829,0.00818943]
f_29: 143409.37014, [1.31776,0.00942966]
f_30: 143409.25932, [1.31702,0.00951907]
f_31: 143409.24635, [1.31692,0.00951671]
f_32: 143409.22105, [1.31672,0.00950082]
f_33: 143409.17595, [1.31633,0.00945426]
f_34: 143409.10778, [1.31554,0.00936069]
f_35: 143408.99981, [1.31395,0.00923051]
f_36: 143409.11111, [1.31185,0.00897081]
f_37: 143408.9328, [1.31439,0.00941748]
f_38: 143408.71263, [1.3128,0.00946254]
f_39: 143408.29263, [1.30962,0.00948033]
f_40: 143407.46398, [1.30325,0.00944035]
f_41: 143406.05675, [1.29052,0.00911243]
f_42: 143403.31125, [1.26506,0.00879603]
f_43: 143401.54026, [1.21415,0.00780416]
f_44: 143396.14586, [1.20279,0.00841878]
f_45: 143391.04011, [1.17739,0.0102159]
f_46: 143387.1939, [1.15196,0.00897777]
f_47: 143383.86535, [1.1265,0.00882717]
f_48: 143378.20151, [1.07558,0.00833507]
f_49: 143368.11174, [1.02468,0.00991746]
f_50: 143351.94197, [0.922835,0.0097862]
f_51: 143317.34013, [0.719146,0.0095397]
f_52: 143249.65238, [0.311767,0.0096406]
f_53: 143234.90081, [0.0,0.0108462]
f_54: 153266.3997, [0.0,0.0]
f_55: 143236.18137, [0.0108416,0.0111644]
f_56: 143312.91379, [0.0,0.0216925]
f_57: 143234.89379, [0.00542052,0.0108433]
f_58: 143260.48202, [0.00181902,0.0148979]
f_59: 143232.64299, [0.00487847,0.00975901]
f_60: 143234.66662, [0.00741215,0.00879308]
f_61: 143233.37725, [0.00509825,0.0103309]
f_62: 143234.28069, [0.00386846,0.0106634]
f_63: 143233.15163, [0.00514999,0.00922613]
f_64: 143232.79521, [0.00547335,0.00943397]
f_65: 143232.65672, [0.00475013,0.00980271]
f_66: 143232.63871, [0.00521639,0.00973263]
f_67: 143232.64022, [0.00555523,0.00974079]
f_68: 143232.63846, [0.00519816,0.00973069]
f_69: 143232.63849, [0.00516152,0.00973133]
f_70: 143232.64119, [0.00519516,0.00974877]
f_71: 143232.63707, [0.00519664,0.00971243]
f_72: 143232.63685, [0.00516081,0.00970472]
f_73: 143232.63677, [0.00508753,0.00970302]
f_74: 143232.6367, [0.00501424,0.00970158]
f_75: 143232.63662, [0.00494096,0.00970364]
f_76: 143232.63655, [0.00486769,0.00970559]
f_77: 143232.63649, [0.00479439,0.00970666]
f_78: 143232.63635, [0.00464778,0.00970688]
f_79: 143232.63626, [0.00450119,0.00970896]
f_80: 143232.6361, [0.00435459,0.00970816]
f_81: 143232.63591, [0.00406139,0.00971066]
f_82: 143232.63566, [0.00376818,0.0097104]
f_83: 143232.63537, [0.00318176,0.00971386]
f_84: 143232.63503, [0.00259534,0.00971415]
f_85: 143232.63463, [0.00142249,0.00971612]
f_86: 143232.63533, [0.000249701,0.00972778]
f_87: 143232.92661, [0.00142012,0.00934017]
f_88: 143232.63418, [0.000836292,0.00969989]
f_89: 143232.63431, [0.00142272,0.00970059]
f_90: 143232.80051, [0.000849809,0.00999279]
f_91: 143232.63446, [0.000952884,0.00971503]
f_92: 143232.64471, [0.000849105,0.00963252]
f_93: 143232.63416, [0.000888831,0.0097039]
f_94: 143232.63415, [0.000830049,0.00970325]
f_95: 143232.63418, [0.000771406,0.00970735]
f_96: 143232.63589, [0.000832025,0.0097322]
f_97: 143232.63415, [0.000835716,0.00970347]
f_98: 143232.63415, [0.00082632,0.00970365]
f_99: 143232.63415, [0.000824052,0.00970369]
f_100: 143232.63415, [0.000819517,0.00970378]
f_101: 143232.63415, [0.000816065,0.00970384]
f_102: 143232.63415, [0.000814649,0.00970387]
f_103: 143232.63415, [0.000811198,0.00970398]
f_104: 143232.63415, [0.000808462,0.009704]
f_105: 143232.63415, [0.000806745,0.00970404]
f_106: 143232.63415, [0.00080364,0.00970411]
f_107: 143232.63415, [0.000802646,0.00970413]
FTOL_REACHED

Officially the count of the number of function evaluations is 107 but
everything from 71 on is minor adjustments in the parameter values
affecting only the 9th significant digit of the criterion and beyond.
Also, the nature of the simulation is such that these parameter values are
so small that the differences are negligible.

The convergence warnings from lmer are often spurious and, because there
are so many false positives, they do more harm than good.  My approach in
the MixedModels package is to use a reliable, fast optimizer (LN_BOBYQA
from Steven Johnson's NLopt package) and set relatively stringent
tolerances for convergence.  This doesn't guarantee convergence to the
global optimum but there really is no good way of doing that.  Most of the
time the optimum is fairly well defined, in which case a good optimizer
algorithm gets you there quickly.

Regarding how to teach this: if your students are not frightened of vectors
and matrices, I would start with Henderson's mixed model equations because
they are well-established in the literature.  The profiled log-likelihood
can be expressed using the solution to those equations (although Henderson
didn't realize this) because all you need is the minimum penalized residual
sum-of-squares and the determinant of a matrix that looks rather benign.
The purpose of all of the contortions in lmer and lmm is to evaluate that
determinant efficiently as a by-product of solving the mixed-model equations

On Wed, Oct 21, 2015 at 10:53 AM Paul Johnson <pauljohn32 at gmail.com> wrote:

> Thanks to everybody for looking at the example. If that code can be
> re-used in any way that helps lme4 development, I give permission to
> re-use or edit it and put it to use. I'm happy to let everybody who
> actually understands this debate it.  I don't (yet).
>
> I need to explain to users why we have these warnings with lmer but
> not SAS or Stata.  In the output I pasted in to the original email, it
> reports convergence in a few steps of EM.  But lmer is going for a lot
> more iterations.  How to explain that difference to students?
>
> I'm reading through the papers that Doug has written in the last 10
> years or so explaining the estimation process in PLS.   Bates and
> Debroy makes this clear for LMM.  In comparison, the mainstream HLM
> folks treated MLM a a GLS problem. Raudenbush & Bryk, for example, or
> Snidjers & Bosker, describe calculation of predictions for the b's as
> a posterior calculation, rather than an element of the optimization.
>
> It appears to me Stata is written that GLS way.  Stata has a parameter
> vector with fixed effects and variances of random effects (Beta,
> Sigma).   In contrast, lmer i optimising over (Beta, Sigma, b).
>
> Am I just making up a story here?
>
> pj
>
> On Fri, Oct 16, 2015 at 1:35 PM, Douglas Bates <bates at stat.wisc.edu>
> wrote:
> > For those who may be interested, these are the results of timing the
> fits of
> > two models on these simulated data.  For consistency within the timings I
> > have renamed the grouping factor Mind to G and named the three continuous
> > covariates as S, T and U.  The optimizers whose names start with LN_ are
> > timings from the Julia MixedModels package using the NLopt package for
> > optimization.  Those whose names start with NLOPT_LN_ are the same
> optimizer
> > code accessed through the nloptr package for R.  The others are from the
> > optimx package, bobyqa from the minqa package (the default for lmer) and
> the
> > build-in Nelder_Mead optimizer, which is generally pretty bad and I can
> say
> > that because I wrote it.
> >
> > dsname = "paulsim"
> > form = Formula: Y ~ 1 + S + T + U + (1 | G) + ((0 + S) | G)
> > -2log(likelihood) time(s) feval geval optimizer
> >    143232.6341     1.5120   606     0 bobyqa
> >    143564.1597     0.2770    70     0 Nelder_Mead
> >    143232.9465     0.2680    66     0 NLOPT_LN_BOBYQA
> >    143272.7444     0.2430    53     0 NLOPT_LN_COBYLA
> >    143803.9823     0.3420    40     0 NLOPT_LN_NELDERMEAD
> >    143232.6341     0.4570   147     0 NLOPT_LN_SBPLX
> >    143232.6582     0.6320    58     0 optimx:L-BFGS-B
> >    143232.6341     0.5480   104     0 optimx:nlminb
> >    143232.6341     6.7930    NA     0 optimx:spg
> >    143232.6341     1.6930    NA     0 optimx:bobyqa
> >    143232.6341     0.0489   107     0 LN_BOBYQA
> >    143232.6382     1.9885 69711     0 LN_COBYLA
> >    143803.9823     0.0474    56     0 LN_NELDERMEAD
> >    143232.6341     0.0527   147     0 LN_SBPLX
> > form = Formula: Y ~ 1 + S + T + U + ((0 + S) | G)
> > -2log(likelihood) time(s) feval geval optimizer
> >    143232.6341     0.1400    41     0 bobyqa
> >    143232.6341     0.1510    49     0 Nelder_Mead
> >    143232.6343     0.1360    36     0 NLOPT_LN_BOBYQA
> >    143232.6503     0.1170    24     0 NLOPT_LN_COBYLA
> >    143232.6341     0.1540    48     0 NLOPT_LN_NELDERMEAD
> >    143232.6341     0.1900    74     0 NLOPT_LN_SBPLX
> >    143232.6368     0.3560    70     0 optimx:L-BFGS-B
> >    143232.6341     0.2470    29     0 optimx:nlminb
> >    143232.6341     0.3660    NA     0 optimx:spg
> >    143232.6341     0.2650    NA     0 optimx:bobyqa
> >    143232.6341     0.0240    43     0 LN_BOBYQA
> >    143232.6341     0.0240    34     0 LN_COBYLA
> >    143232.6341     0.0242    52     0 LN_NELDERMEAD
> >    143232.6341     0.0246    81     0 LN_SBPLX
> >
>
>
>
> --
> Paul E. Johnson
> Professor, Political Science        Director
> 1541 Lilac Lane, Room 504      Center for Research Methods
> University of Kansas                 University of Kansas
> http://pj.freefaculty.org              http://crmda.ku.edu
>

	[[alternative HTML version deleted]]