[R-sig-ME] I'm sorry, and here is what I mean to ask about speed

Thu Sep 24 18:29:59 CEST 2009

> I'm sorry I made Doug mad and I'm sorry to have led the 
> discussion off into such a strange, disagreeable place.

I don't think you did. He invests a lot in the lmer function and matrix
pakage (with Martin) and likes to see it represented properly.

> Now that I understand your answers, I believe I can ask the 
> question in a non-naive way.  I believe this version should 
> not provoke some of the harsh words that I triggered in my 
> awkward question.
> 
> New Non-Naive version of the Speed Question
> 
> Do you have a copy of HLM6 on your system?  Maybe you could 
> help me by running the same model in R (with any of the 
> packages such as lme4, nlme, or whatever) and HLM6 and let us 
> all know if you get similar estimates and how long it takes 
> to run each one.

Yes, I do, and I have. But, I *think* you can only look at a wall clock
for HLM timings. Below is an example of a model that you can run in both
packages. Takes HLM a bit longer (by looking at a wall clock) on my
machine to run this same model. These data come freely with HLM called
EG1, EG2, and EG3

library(mlmRev)
> system.time(fm1 <- lmer(math ~ year + retained + black + hispanic +
size + lowinc + mobility +(year|schoolid) + (year|childid), egsingle))
   user  system elapsed 
   7.75    0.17    7.92 

> 
> Here's why I ask.
> 
> My colleague has HLM6 on Windows XP and he compared a 
> two-level linear mixed effects model fitted with lmer from 
> lme4 against HLM6.  He surprised my by claiming that the HLM6 
> model estimation was completed in about 1.5 seconds and the 
> lmer estimation took 50 seconds.  That did not seem right to 
> me.  I looked a bit at his example and made a few mental 
> notes so I could ask you what to look for when I go back to 
> dig into this.  There are 27000 cases in his datasets and he has about
> 25 variables at the lower level of observation and 4 or 5 
> variables at the higher level, which I think is the county of 
> survey respondents.
> He is fitting a random intercept (random across counties) and 
> several random slopes for the higher level variables.

Like I said yesterday, examples like this aren't too interesting. I
don't really care if one software can do something in 1.5 seconds
whereas another takes 50 seconds. Plus, this is a very simple model with
a nested design. There are many, many sofwtare programs for these kinds
of models. I think differences such as those are petty. What *is*
interesting is when the models become so complex, like those with many
levels of the random effects and complex covariance structures. At that
point you will find that HLM cannot even estimate those models. Doug and
I have recently estimated very large models where the dimensions of the
model matrix for the random effects was huge and run time was on the
order of a few hours. Now, take a model with crossed random effects for,
say 1 million students each of whom has three observations. Now, try and
run that model with random effects for students and their teachers in
each of those years in HLM and see what happens. 
> 
> He pointed out that the mlWin website reported speed differences in
> 2006 that were about the same.  Of course, you and I know 
> that R and all of the mixed effects packages have improved 
> significantly since then. That is why the speed gap on the 
> one Windows XP system surprised me.
> 
> Can you tell me if you see a difference between the two 
> programs (if you have HLM6).  If you see a difference on the 
> same magnitude, it may mean we are not mistaken in our 
> conclusion.  But if you see no difference, then it will mean 
> I'm getting it wrong and I should investigate more. If I 
> can't solve it, I should provide a reproducible example for 
> your inspection.  I will ask permission to release the 
> private data to  you in that case.

See above for the example. But, despite my better judgement, I gave the
example anyhow. I don't see differences in the order of a minute or so
interesting at all. The more meaningful differences are found when the
models become complex.

> 
> Perhaps you think there are good reasons why R estimation 
> takes longer.  E.g.:

But I don't think it does.

> 1. HLM programmers have full access to benefit from 
> optimizations in lmer and other open programs, but they don't 
> share their optimizations in return.

I believe Richard Congden and Steve Raudenbush to be very virtuous and
have published quite a bit on this topic. There may be some
computational details that remain unknown, but they have been very
transparent in their work.

> 2. lmer and other R routines are making calculations in a 
> better way, a more accurate way, so we should not worry that 
> they take longer.

While I am a big user of the lmer functions, I don't necessarily believe
this is accurate. Doug has taken great care to ensure lmer returns
reliable estimates. But, so have the HLM crew. I have routinely found
that HLM and lmer give the same estimates for models with nested random
effects. Since HLM cannot estimate large models with crossed random
effects, I cannot form any comparison. 

>    That was my first guess, in the original mail I said I 
> thought that HLM was using PQL whereas lmer is using Laplace 
> or Adaptive Gaussian Quadrature.  But Doug's comment 
> indicated that I was mistaken to expect a difference there 
> because REML is the default in lmer and it is also what HLM 
> is doing, and there's no involvement of quadrature or 
> integral approximation in a mixed linear model (gaussian 
> dependent variable).

I think you remain a little confused here. For generalized linear mixed
models, R and HLM have different ways for evaluating the
integral--although I have found that, with nested designs, both yield
results comparable out to multiple decimal points. But, with linear
mixed models that is not an issue. With linear mixed models, REML is the
default in both HLM and R. But, the HLM folks view the world as a GLS
problem and Doug views this as a PLS problem. 

> 
> On the other hand, perhaps you are (like me) surprised by 
> this difference and you want to help me figure out the cause 
> of the differences.  If you have ideas about that, maybe we 
> can work together (I don't suck at C!). I have pretty much 
> experience profiling programs in C and did some optimization 
> help on a big-ish C++ based R package this summer.
> 
> So far, I have a simple observer's interest in this question.   I
> advise people whether it is beneficial for them to spend 
> their scarce resources on a commercial package like HLM6 and 
> one of the factors
> that is important to them is how "fast" the programs are.   I
> personally don't see an urgent reason to buy HLM because it 
> can estimate a model in 1 second and an open source approach 
> requires 50 seconds.  But I'm not the one making the 
> decision. If I can make the R version run almost as fast as 
> HLM6, or provide reasons why people might benefit from using 
> a program that takes longer, then I can do my job of advising 
> the users.

Yes, "fast" is important when it comes to big problems. But, there are
many other issues to consider as well, and I would place those far above
the "fast" issue. For instance, what if you want to create an
interaction term in HLM? You must first, work with a stat package to
manually create the interaction variable, import those data into HLM and
then run the model. So, it takes two software programs to get the job
done. What if you want presentation-style visual displays of your data?
What if you want (close to) on-demand support from real statisticians on
your problem? 

Last, and IMHO the biggest factor, what if you want to modify the code
in some way to customize your analysis? I suppose all of this is to say
look beyond the fast issue for simple problems in your evaluation
criteria. I would suggest that you consider what your world would look
like if you needed any of the info I discuss in the preceding paragraph.

Just my .02 cents.

> 
> I am sorry if this question appears impertinent or insulting. 
> I do not mean it as a criticism.
> 
> --
> Paul E. Johnson
> Professor, Political Science
> 1541 Lilac Lane, Room 504
> University of Kansas
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>