[R-sig-ME] I'm sorry, and here is what I mean to ask about speed

Fri Sep 25 04:46:48 CEST 2009

On Thu, Sep 24, 2009 at 11:29 AM, Doran, Harold <HDoran at air.org> wrote:
>> I'm sorry I made Doug mad and I'm sorry to have led the
>> discussion off into such a strange, disagreeable place.
>
> I don't think you did. He invests a lot in the lmer function and matrix
> pakage (with Martin) and likes to see it represented properly.
>
>> Now that I understand your answers, I believe I can ask the
>> question in a non-naive way.  I believe this version should
>> not provoke some of the harsh words that I triggered in my
>> awkward question.
>>
>> New Non-Naive version of the Speed Question
>>
>> Do you have a copy of HLM6 on your system?  Maybe you could
>> help me by running the same model in R (with any of the
>> packages such as lme4, nlme, or whatever) and HLM6 and let us
>> all know if you get similar estimates and how long it takes
>> to run each one.
>
> Yes, I do, and I have. But, I *think* you can only look at a wall clock
> for HLM timings. Below is an example of a model that you can run in both
> packages. Takes HLM a bit longer (by looking at a wall clock) on my
> machine to run this same model. These data come freely with HLM called
> EG1, EG2, and EG3
>
> library(mlmRev)
>> system.time(fm1 <- lmer(math ~ year + retained + black + hispanic +
> size + lowinc + mobility +(year|schoolid) + (year|childid), egsingle))
>   user  system elapsed
>   7.75    0.17    7.92
>
>>
>> Here's why I ask.
>>
>> My colleague has HLM6 on Windows XP and he compared a
>> two-level linear mixed effects model fitted with lmer from
>> lme4 against HLM6.  He surprised my by claiming that the HLM6
>> model estimation was completed in about 1.5 seconds and the
>> lmer estimation took 50 seconds.  That did not seem right to
>> me.  I looked a bit at his example and made a few mental
>> notes so I could ask you what to look for when I go back to
>> dig into this.  There are 27000 cases in his datasets and he has about
>> 25 variables at the lower level of observation and 4 or 5
>> variables at the higher level, which I think is the county of
>> survey respondents.
>> He is fitting a random intercept (random across counties) and
>> several random slopes for the higher level variables.
>
> Like I said yesterday, examples like this aren't too interesting. I
> don't really care if one software can do something in 1.5 seconds
> whereas another takes 50 seconds. Plus, this is a very simple model with
> a nested design. There are many, many sofwtare programs for these kinds
> of models. I think differences such as those are petty. What *is*
> interesting is when the models become so complex, like those with many
> levels of the random effects and complex covariance structures. At that
> point you will find that HLM cannot even estimate those models. Doug and
> I have recently estimated very large models where the dimensions of the
> model matrix for the random effects was huge and run time was on the
> order of a few hours. Now, take a model with crossed random effects for,
> say 1 million students each of whom has three observations. Now, try and
> run that model with random effects for students and their teachers in
> each of those years in HLM and see what happens.
>>
>> He pointed out that the mlWin website reported speed differences in
>> 2006 that were about the same.  Of course, you and I know
>> that R and all of the mixed effects packages have improved
>> significantly since then. That is why the speed gap on the
>> one Windows XP system surprised me.
>>
>> Can you tell me if you see a difference between the two
>> programs (if you have HLM6).  If you see a difference on the
>> same magnitude, it may mean we are not mistaken in our
>> conclusion.  But if you see no difference, then it will mean
>> I'm getting it wrong and I should investigate more. If I
>> can't solve it, I should provide a reproducible example for
>> your inspection.  I will ask permission to release the
>> private data to  you in that case.
>
> See above for the example. But, despite my better judgement, I gave the
> example anyhow. I don't see differences in the order of a minute or so
> interesting at all. The more meaningful differences are found when the
> models become complex.
>
>>
>> Perhaps you think there are good reasons why R estimation
>> takes longer.  E.g.:
>
> But I don't think it does.
>
>> 1. HLM programmers have full access to benefit from
>> optimizations in lmer and other open programs, but they don't
>> share their optimizations in return.
>
> I believe Richard Congden and Steve Raudenbush to be very virtuous and
> have published quite a bit on this topic. There may be some
> computational details that remain unknown, but they have been very
> transparent in their work.
>
>> 2. lmer and other R routines are making calculations in a
>> better way, a more accurate way, so we should not worry that
>> they take longer.
>
> While I am a big user of the lmer functions, I don't necessarily believe
> this is accurate. Doug has taken great care to ensure lmer returns
> reliable estimates. But, so have the HLM crew. I have routinely found
> that HLM and lmer give the same estimates for models with nested random
> effects. Since HLM cannot estimate large models with crossed random
> effects, I cannot form any comparison.
>
>>    That was my first guess, in the original mail I said I
>> thought that HLM was using PQL whereas lmer is using Laplace
>> or Adaptive Gaussian Quadrature.  But Doug's comment
>> indicated that I was mistaken to expect a difference there
>> because REML is the default in lmer and it is also what HLM
>> is doing, and there's no involvement of quadrature or
>> integral approximation in a mixed linear model (gaussian
>> dependent variable).
>
> I think you remain a little confused here. For generalized linear mixed
> models, R and HLM have different ways for evaluating the
> integral--although I have found that, with nested designs, both yield
> results comparable out to multiple decimal points. But, with linear
> mixed models that is not an issue. With linear mixed models, REML is the
> default in both HLM and R. But, the HLM folks view the world as a GLS
> problem and Doug views this as a PLS problem.

That's true.  I think there are additional differences in the behavior
at the edge cases and in terms of the profiling of the objective
function (the deviance or the REML criterion).  The algorithm
implemented in lmer uses penalized least squares (PLS) instead of
Generalized least squares (GLS) as you mentioned.  In mathematical
terms these are "dual" problems; a GLS problem can be written as PLS
and a PLS problem can be written as GLS.  In a strictly hierarchical
model there is not a clear advantage one way or the other.  In a model
with non-nested random effects PLS is the winner big-time. You don't
want to contemplate GLS in those cases.

There is a second duality where the choice made in lmer is different
from the conventional choice.  If you have seen Henderson's
mixed-model equations (http://en.wikipedia.org/wiki/Mixed_model) the
conventional representation uses the precision matrix of the random
effects (G^{-1} in the Wikipedia entry - a precision matrix is the
inverse of a variance-covariance matrix).  That's fine except when G
becomes singular, as it can.  The representation in lmer incorporates
the effects of G into the model matrix, not the penalty.  A detailed
explanation is given in the slides from a presentation at DSC2009 and
available as http://matrix.r-forge.r-project.org/slides/2009-07-14-Copenhagen/BatesMaechlerD.pdf
These slides include an example from you, Harold.

>>
>> On the other hand, perhaps you are (like me) surprised by
>> this difference and you want to help me figure out the cause
>> of the differences.  If you have ideas about that, maybe we
>> can work together (I don't suck at C!). I have pretty much
>> experience profiling programs in C and did some optimization
>> help on a big-ish C++ based R package this summer.
>>
>> So far, I have a simple observer's interest in this question.   I
>> advise people whether it is beneficial for them to spend
>> their scarce resources on a commercial package like HLM6 and
>> one of the factors
>> that is important to them is how "fast" the programs are.   I
>> personally don't see an urgent reason to buy HLM because it
>> can estimate a model in 1 second and an open source approach
>> requires 50 seconds.  But I'm not the one making the
>> decision. If I can make the R version run almost as fast as
>> HLM6, or provide reasons why people might benefit from using
>> a program that takes longer, then I can do my job of advising
>> the users.
>
> Yes, "fast" is important when it comes to big problems. But, there are
> many other issues to consider as well, and I would place those far above
> the "fast" issue. For instance, what if you want to create an
> interaction term in HLM? You must first, work with a stat package to
> manually create the interaction variable, import those data into HLM and
> then run the model. So, it takes two software programs to get the job
> done. What if you want presentation-style visual displays of your data?
> What if you want (close to) on-demand support from real statisticians on
> your problem?

It is interesting that in Paul's original posting on this he described
the problematic example as having 40 first level "variables" and 5
second-level variables (or the other way around, I can never remember
if level-1 refers to the most specific grouping, usually an
individual, or the least specific grouping, such as district if the
data are grouped by student, class, school and district).  I'm willing
to bet that these aren't variables or covariates in the sense that I
would think of them.  If you had a race/ethnicity covariate with
(typically in the U.S.) five levels then that would count as 4
variables in HLM, I believe, but only 1 variable in R.  The formula
language in R creates the model matrix from the data and the terms in
the model.  Other systems require that you create the model matrix
which leads to people thinking that it is natural to use 0/1 encodings
for characteristics like sex.  For a binary classification that is not
too risky but when you have more than two levels you must create a set
of indicators and maintain consistency in them, and the implicit
"missing" indicator.  It is just much more awkward and yet, if you
started using that system, you find it to be the natural way of doing
things (sort of like people who grew up using the Imperial system of
weights and measures: feet, yards, furlongs, miles, etc. complaining
that the metric system is so difficult to understand).

> Last, and IMHO the biggest factor, what if you want to modify the code
> in some way to customize your analysis? I suppose all of this is to say
> look beyond the fast issue for simple problems in your evaluation
> criteria. I would suggest that you consider what your world would look
> like if you needed any of the info I discuss in the preceding paragraph.
>
> Just my .02 cents.
>
>
>
>>
>> I am sorry if this question appears impertinent or insulting.
>> I do not mean it as a criticism.
>>
>> --
>> Paul E. Johnson
>> Professor, Political Science
>> 1541 Lilac Lane, Room 504
>> University of Kansas
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>