[R-sig-ME] multicore lmer: any changes since 2007?

Douglas Bates bates at stat.wisc.edu
Fri Mar 16 15:45:53 CET 2012


On Fri, Mar 16, 2012 at 7:32 AM, Mike Lawrence <Mike.Lawrence at dal.ca> wrote:
> Apologies for resurrecting a months-old thread, but I thought I'd note
> that DEoptim has now been made parallel:
>
> http://blog.fosstrading.com/2012/03/deoptim-in-parallel.html

Before trying any of the options for speeding up execution one should
first profile the execution ("profile" in the sense of Rprof and
profiling the underlying compiled code).  My guess is that the
calculations in the optimizer algorithm are not the bottleneck, it is
the evaluation of the deviance that takes the time.

Recently I switched to using the Google perftools
(http://code.google.com/p/gperftools/) for malloc/free to get a better
handle on potential memory errors in the development process.  It
should be possible to use the profiling of the compiled code from the
perftools too.

> On Wed, Jul 6, 2011 at 2:40 PM, Ben Bolker <bbolker at gmail.com> wrote:
>> On 07/06/2011 11:40 AM, Mike Lawrence wrote:
>>> Thanks for the detailed reply.
>>>
>>> According to the comment by Joshua Ulrich (one of the DEoptim
>>> developers) on this stackoverflow post
>>> (http://stackoverflow.com/questions/3759878/parallel-optimization-in-r),
>>> it seems that DEoptim might be a parallel-izable optimizer, and as I
>>> recall you can put box constraints on parameters with DEoptim. I just
>>> sent off an email to the DEoptim developers to see if there's been any
>>> progress on the parallel front.
>>>
>>> Mike
>>
>>  I have used differential evolution in the past (although not in
>> decades (!!)), although not the DEoptim() package, but I don't think
>> DEoptim() will be appropriate for this purpose.  I'm actually not
>> entirely clear on what DB means by "parallel evaluation of the objective
>> function".  In the simplest derivative-free case for example (the
>> Nelder-Mead simplex), it's hard to see how one could evaluate the
>> objective function in parallel because each evaluation changes the
>> structure of the simplex and determines where the next evaluation should
>> be.  A very quick look at BOBYQA (source code in the minqa package, or
>> formal description at
>> <http://www.damtp.cam.ac.uk/user/na/NA_papers/NA2009_06.pdf>) suggests
>> the same one-point-at-a-time updating scheme.
>>
>>  But DB says
>>
>>>> In many cases you know several points where you will be
>>>> evaluating the objective so you could split those
>>>> off into different threads.
>>
>>  Since he has (probably literally) forgotten more about numerical
>> computation than I ever knew, he's probably right, but I don't know of
>> those examples.
>>
>>  Interesting discussion ...
>>
>>
>>>
>>> On Wed, Jul 6, 2011 at 11:59 AM, Douglas Bates <bates at stat.wisc.edu> wrote:
>>>> On Tue, Jul 5, 2011 at 4:52 PM, Mike Lawrence <Mike.Lawrence at dal.ca> wrote:
>>>>> Back in 2007 (http://tolstoy.newcastle.edu.au/R/e2/help/07/05/17777.html)
>>>>> Dr. Bates suggested that using a multithreaded BLAS was the only
>>>>> option for speeding lmer computations on multicore machines (and even
>>>>> then, it might even cause a slow down under some circumstances).
>>>>>
>>>>> Is this advice still current, or have other means of speeding lmer
>>>>> computations on multicore machines arisen in more recent years?
>>>>
>>>> As always, the problem with trying to parallelize a particular
>>>> calculation is to determine how and when to start more than one
>>>> thread.
>>>>
>>>> After the setup stage the calculations in fitting an lmer model
>>>> involve optimizing the profiled deviance or profiled REML criterion.
>>>> Each evaluation of the criterion involves updating the components of
>>>> the relative covariance factor, updating the sparse Cholesky
>>>> decomposition and solving a couple of systems of equations involving
>>>> the sparse Cholesky factor.
>>>>
>>>> There are a couple of calculations involving dense matrices but in
>>>> most cases the time spent on them is negligible relative to the
>>>> calculations involving the sparse matrices.
>>>>
>>>> A multithreaded BLAS will only help speed up the calculations on dense
>>>> matrices.  The "supernodal" form of the Cholesky factorization can use
>>>> the BLAS for some calculations but usually on small blocks.  Most of
>>>> the time the software chooses the "simplicial" form of the
>>>> factorization because the supernodal form would not be efficient and
>>>> the simplicial form doesn't use the BLAS at all.  Even if the
>>>> supernodal form is chosen, the block sizes are usually small and a
>>>> multithreaded BLAS can actually slow down operations on small blocks
>>>> because the communication and synchronization overhead cancels out any
>>>> gain from using multiple cores.
>>>>
>>>> Of course, your mileage may vary and only by profiling both the R code
>>>> and the compiled code will you be able to determine how things could
>>>> be sped up.
>>>>
>>>> If I had to guess, I would say that the best hope for parallelizing
>>>> the computation would be to find an optimizer that allows for parallel
>>>> evaluation of the objective function.  The lme4 package requires
>>>> optimization of a nonlinear objective subject to "box constraints"
>>>> (meaning that some of the parameters can have upper and/or lower
>>>> bounds).  Actually it is simpler than that, some of the parameters
>>>> must be positive.  We do not provide gradient evaluations.  I once*
>>>> worked out the gradient of the criterion (I think it was the best
>>>> mathematics I ever did) and then found that it ended up slowing the
>>>> optimization to a crawl in the difficult cases.  A bit of reflection
>>>> showed that each evaluation of the gradient could be hundreds or
>>>> thousands of times more complex than an evaluation of the objective
>>>> itself so you might as well use a gradient free method and just do
>>>> more function evaluations.  In many cases you know several points
>>>> where you will be evaluating the objective so you could split those
>>>> off into different threads.
>>>>
>>>> I don't know of such a mulithreaded optimizer (many of the optimizers
>>>> that I find are still being written in Fortran 77, God help us) but
>>>> that would be my best bet if one could be found.  However, I am saying
>>>> this without having done the profiling of the calculation myself so
>>>> that is still a guess.
>>>>
>>>> * Bates and DebRoy, "Linear mixed models and penalized least squares",
>>>> Journal of Multivariate Analysis, 91 (2004) 1-17
>>>>
>>>> _______________________________________________
>>>> R-sig-mixed-models at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models




More information about the R-sig-mixed-models mailing list