[R-sig-ME] multicore lmer: any changes since 2007?

Douglas Bates bates at stat.wisc.edu
Wed Jul 6 16:59:07 CEST 2011


On Tue, Jul 5, 2011 at 4:52 PM, Mike Lawrence <Mike.Lawrence at dal.ca> wrote:
> Back in 2007 (http://tolstoy.newcastle.edu.au/R/e2/help/07/05/17777.html)
> Dr. Bates suggested that using a multithreaded BLAS was the only
> option for speeding lmer computations on multicore machines (and even
> then, it might even cause a slow down under some circumstances).
>
> Is this advice still current, or have other means of speeding lmer
> computations on multicore machines arisen in more recent years?

As always, the problem with trying to parallelize a particular
calculation is to determine how and when to start more than one
thread.

After the setup stage the calculations in fitting an lmer model
involve optimizing the profiled deviance or profiled REML criterion.
Each evaluation of the criterion involves updating the components of
the relative covariance factor, updating the sparse Cholesky
decomposition and solving a couple of systems of equations involving
the sparse Cholesky factor.

There are a couple of calculations involving dense matrices but in
most cases the time spent on them is negligible relative to the
calculations involving the sparse matrices.

A multithreaded BLAS will only help speed up the calculations on dense
matrices.  The "supernodal" form of the Cholesky factorization can use
the BLAS for some calculations but usually on small blocks.  Most of
the time the software chooses the "simplicial" form of the
factorization because the supernodal form would not be efficient and
the simplicial form doesn't use the BLAS at all.  Even if the
supernodal form is chosen, the block sizes are usually small and a
multithreaded BLAS can actually slow down operations on small blocks
because the communication and synchronization overhead cancels out any
gain from using multiple cores.

Of course, your mileage may vary and only by profiling both the R code
and the compiled code will you be able to determine how things could
be sped up.

If I had to guess, I would say that the best hope for parallelizing
the computation would be to find an optimizer that allows for parallel
evaluation of the objective function.  The lme4 package requires
optimization of a nonlinear objective subject to "box constraints"
(meaning that some of the parameters can have upper and/or lower
bounds).  Actually it is simpler than that, some of the parameters
must be positive.  We do not provide gradient evaluations.  I once*
worked out the gradient of the criterion (I think it was the best
mathematics I ever did) and then found that it ended up slowing the
optimization to a crawl in the difficult cases.  A bit of reflection
showed that each evaluation of the gradient could be hundreds or
thousands of times more complex than an evaluation of the objective
itself so you might as well use a gradient free method and just do
more function evaluations.  In many cases you know several points
where you will be evaluating the objective so you could split those
off into different threads.

I don't know of such a mulithreaded optimizer (many of the optimizers
that I find are still being written in Fortran 77, God help us) but
that would be my best bet if one could be found.  However, I am saying
this without having done the profiling of the calculation myself so
that is still a guess.

* Bates and DebRoy, "Linear mixed models and penalized least squares",
Journal of Multivariate Analysis, 91 (2004) 1-17




More information about the R-sig-mixed-models mailing list