[R-sig-ME] optimizers for mixed models

Thu Mar 14 18:08:18 CET 2013

Ross Boylan <ross at ...> writes:

> 
> lme4 appears to use Nelder Mead for the final stage of its 
> optimization.  The archives show various concerns about stability, but 
> it looks as if even the prior optimizer was in the simplex class.

  Depends how far you go back.  nlme and older versions of lme4 used
nlminb.  This is derivative-based, but you'd have to look at the guts
of nlminb and at the PORT documentation for a detailed description ...

 http://netlib.bell-labs.com/cm/cs/cstr/153.pdf

I think it's quasi-Newton ... digging around more in the R code
guts shows it's NL2SOL -- you could dig some more on Netlib to
find the details 

> This surprised me, since I have found Nelder Mead to be relatively slow 
> and imprecise, and in a more complex mixed model with sampling I've been 
> using optim with L-BFGS-B, which is quasi-newton.  One of the parameter 
> estimates kept creeping up over iterations; the bounds were necessary to 
> cut it off.

  Not quite sure what you mean here.  I have personally found L-BFGS-B
to be somewhat unstable in my experience ...  The optimplus package
on R-forge gives a lot more optimizer choices ...

> If anyone can give me more insight into why Nelder Mead is in use, and 
> perhaps whether I should be using it myself, I'd appreciate it.

  Nelder-Mead *is* fairly slow, but it tends to be slightly more robust
over a broad class of problems.  The current development version of lme4
lets you pick your own optimizer, so you can make the speed vs
robustness tradeoff however you like ...

> The current behavior, in which about 10% of the simulated datasets have 
> convergence problems, with one of the parameters heading toward an 
> implausible value (a correlation of 1.0, where atanh(rho) is the actual 
> parameter being estimated) is certainly not ideal.  I tried simulated 
> annealing to see if the algorithm had wandered mistakenly toward a local 
> rather than global optimum; there was no evidence that it had (the SA 
> ended up in the same neighborhood as BFGS, and when that point was the 
> starting value for BFGS it ended on essentially the same point as before).

   Can you send me a simulated example that experiences this problem?

  I can't tell at the moment whether you are using lme4 or your
own hand-rolled solution to the problem.

  There are two possibilities here: (1) the best-fit solution really
is in the limit as rho -> Inf / correlation -> 1.0; this is similar to
the case where the MLE variance of a random effect is zero, even when
the true model (from which the data were simulated) has a non-zero
variance; (2) the best-fit solution is very close to +/- 1.0.

  It's going to be tough either way, but in case #1 you would
definitely be better off with a parameter space in which the
singular cases corresponded to _simple_ box constraints on the
parameters.   For the Cholesky parameterization this is true
for the variances (for a 2-by-2 matrix, theta_1=0 and theta_2=theta_3=0
are the conditions for the variances to be 0), and (mostly) for
the correlations -- the correlations go to +/- 1 when theta_3=0
*or* when theta_2 -> infinity, but in the latter case the variance
will blow up to infinity as well (so probably not a case worth
worrying about).