[R-sig-ME] I'm sorry, and here is what I mean to ask about speed

Thu Sep 24 04:31:21 CEST 2009

Hi Paul,

Comments on speed in-line below

> -----Original Message-----
> From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-
> models-bounces at r-project.org] On Behalf Of Paul Johnson
> Sent: Wednesday, September 23, 2009 6:43 PM
> To: r-sig-mixed-models at r-project.org
> Subject: [R-sig-ME] I'm sorry, and here is what I mean to ask about
> speed
> 
> I'm sorry I made Doug mad and I'm sorry to have led the discussion off
> into such a strange, disagreeable place.
> 
> Now that I understand your answers, I believe I can ask the question
> in a non-naive way.  I believe this version should not provoke some of
> the harsh words that I triggered in my awkward question.
> 
> New Non-Naive version of the Speed Question
> 
> Do you have a copy of HLM6 on your system?  Maybe you could help me by
> running the same model in R (with any of the packages such as lme4,
> nlme, or whatever) and HLM6 and let us all know if you get similar
> estimates and how long it takes to run each one.
> 
> Here's why I ask.
...
> 
> Perhaps you think there are good reasons why R estimation takes longer.
> E.g.:
> 1. HLM programmers have full access to benefit from optimizations in
> lmer and other open programs, but they don't share their optimizations
> in return.
> 2. lmer and other R routines are making calculations in a better way,
> a more accurate way, so we should not worry that they take longer.
>    That was my first guess, in the original mail I said I thought that
> HLM was using PQL whereas lmer is using Laplace or Adaptive Gaussian
> Quadrature.  But Doug's comment indicated that I was mistaken to
> expect a difference there because REML is the default in lmer and it
> is also what HLM is doing, and there's no involvement of quadrature or
> integral approximation in a mixed linear model (gaussian dependent
> variable).

Here's Doug's comment:
<Doug Bates>
But you don't need to speculate about what lmer does.  It is Open
Source so you can check for yourself.

However, this does bring up another point which is the need to compare
apples with apples when you are benchmarking software.  If the data
import and model specification stages in HLM6 create the necessary
matrix structures for performing the iterative fit then does the time
to fit the model consist solely of the optimization and summary
stages?  Using an expression like

system.time(fm1 <- lmer(...))

is assessing the time to take the original data, which could be in a
very general form, create all those internal structures and perform
the optimization.

You should bear in mind that a lot of that construction of the model
structures is written in R exactly so that it is capable of fitting
very general model specifications.  The code in HLM6 is, I imagine,
compiled code, which is possible because it targets a very specific
task, and compiled code is always going to be much faster than
interpreted code.
<\Doug Bates>

So part of the speed difference will be that R is an interpreted
language, whereas HLM6 is compiled.  

The other part is the construction and handling of the model matrix, 
which is a tough one to compare, as lmer() can handle more general models.  

Will your colleague only be fitting models that are within the 
specifications of HLM6, or will your colleague have some datasets with 
structure that HLM6 can not handle, and so will need to shoe-horn the data 
into HLM6 and make compromises that would not need to be made in lmer()?

If the former, then some clever programming (potentially both in R and
in C) can yield a specialized version of lmer() that will be comparable
in speed to HLM6 (I've done such modifications to several functions
over the years so have stopped believing that compiled code is always
faster than R - after all much of R is compiled C).  

If the latter, then the flexibility of the interpreted language version, 
and the implementation speed (i.e versus recoding and recompiling a 
specialized C program to fit new scenarios) generally beats the compiled 
language version.

> 
> On the other hand, perhaps you are (like me) surprised by this
> difference and you want to help me figure out the cause of the
> differences.  If you have ideas about that, maybe we can work together
> (I don't suck at C!). I have pretty much experience profiling programs
> in C and did some optimization help on a big-ish C++ based R package
> this summer.
> 
> So far, I have a simple observer's interest in this question.   I
> advise people whether it is beneficial for them to spend their scarce
> resources on a commercial package like HLM6 and one of the factors
> that is important to them is how "fast" the programs are.   I
> personally don't see an urgent reason to buy HLM because it can
> estimate a model in 1 second and an open source approach requires 50
> seconds.  

When I need to fit hundreds or thousands of models, I overcome the
speed deficit of the interpreted language by using a compute cluster,
far cheaper than the cost of my or other programmers' time that would
be involved to code and compile some specialty software in an effort
to handle the great variety of problems that the interpreted language
can handle.

All that said, the learning curve for the S language is somewhat
steep and a bit long.  If you just have to stuff some data into
something right away and get some numbers out, the $500 or so to
purchase HLM6 may be cheaper than learning R.  But if you're in it
for the long haul, learning how to drive this Race caR is sweet.

Steven McKinney, Ph.D.

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

email: smckinney -at-  bccrc +dot+ ca
tel: 604-675-8000 x7561

BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C.
V5Z 1L3

Canada

> But I'm not the one making the decision. If I can make the R
> version run almost as fast as HLM6, or provide reasons why people
> might benefit from using a program that takes longer, then I can do my
> job of advising the users.
> 
> I am sorry if this question appears impertinent or insulting. I do not
> mean it as a criticism.
> 
> --
> Paul E. Johnson
> Professor, Political Science
> 1541 Lilac Lane, Room 504
> University of Kansas
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models