[R-sig-ME] More naive questions: Speed comparisons? what is a "stack imbalance" in lmer? does lmer center variables?

Wed Sep 23 21:37:24 CEST 2009

On Wed, Sep 23, 2009 at 1:46 PM, Kevin Wright <kw.stat at gmail.com> wrote:
> Of course asreml is "not part of R", but it is certainly available in
> R.  R's license allows for closed-source packages, just not on CRAN.
> To call this "dishonest" is most peculiar.  Is REvolutionR acting
> dishonestly with some of their offerings?

> I'm a strong believer in collaborative development and open source,
> but I believe there's room for closed development models too.  More
> than that, I would even argue that it is *helpful* to R.  Remember
> that MASS, Design, Hmisc, nlme, and survival all started with S-Plus.
> Without the existence of S-Plus we probably would be using
> xlisp-stat-nlme and have to deal with even more parentheses!  I'm sure
> Doug could enlighten us with some interesting stories about how nlme
> started as part of S-Plus.

The nlme package was always open source.  It was S-PLUS that was
closed source.  But we didn't develop nlme in S-PLUS, we developed it
in S from Bell Labs.  I rarely used S-PLUS at all.  Our book was
originally titled "Mixed-effects Models in S".  It was Springer who
lobbied for tacking the "and S-PLUS" on the end.

> From the perspective of developing personal skills that are portable
> to different platforms or careers or whatever, I wish I could use an
> open source mixed models package, but neither nlme nor lme4 nor
> MCMCglmm can fit models to large data sets with a variety of complex
> variance structures, so I use asreml.
>
> On a lighter note, I propose that the members of this list create the
> "Doug Bates foundation" and establish funding for Doug to quit his day
> job and spend his life finishing lme4.

Well, actually, Doug likes his day job (for the most part - about two
hours into a typical faculty meeting he may be open to offers of other
ways to spend his time).  This is the whole point of open-source
software development - it is part of my job.  For all intents and
purposes, the cost of open source software is the cost of the first
copy - creating and distributing subsequent copies is essentially free
relative to the cost of developing the first copy, especially when the
infrastructure for that distribution is available from other
open-source projects such as Apache, Linux, ...  My employer has no
problems with my spending my time developing that first copy.  It's
called "research" and they expect that I will spend at least some of
my time doing that. It happens that the University of Wisconsin has a
very strong tradition of openness with regard to research and freedom
of expression (look up the phrase "sifting and winnowing" on
www.wisc.edu) and is quite supportive of my making software freely
available.

>From time to time it is suggested that it would speed development of R
if money were used to hire "professional programmers" as opposed to
the amateurs who work on it now.  But that really isn't the case.
R-Core is a meritocracy and its members have to believe that anyone
admitted to R-Core is really, really good and unlikely to botch things
up with careless commits.  It is unlikely that an ad for a programmer
on a job-search site is going to get you the next John Chambers or
Brian Ripley or Luke Tierney or ...  These are the top people in the
field and they aren't working on R as a job.  They are doing it
because of the freedom to create the software that they want to make
available, not what the marketing folks think should be done.  Do you
think that the marketing folks would have tolerated a system without a
GUI for this length of time?  If R was driven by marketing
considerations it would be Excel.

People are often taken aback when comparisons of software quality show
that well-established open-source projects are better quality code
than commercial software.  For example, the evaluation of cdf's,
quantile functions and densities or probability functions in R is at
least as good as in commercial software and often much better.  Why?
Well, writing such code is a job for someone as SAS Institute.  For
Martin Maechler, writing code to evaluate these functions accurately
under the widest possible range of arguments is a passion.

The other part of the economics of open-source software versus
commercial software that escapes the usual analysis is that
development is only a small part of the cost of commercial software.
Most of the cost of doing business for software companies is in
marketing and support and support for R is contributed through mailing
lists like this.

>
> Kevin Wright
>
>
> On Wed, Sep 23, 2009 at 11:31 AM, Douglas Bates <bates at stat.wisc.edu> wrote:
>> Got to disagree with you, Kevin.  admb and asreml are not part of R,
>> even in the general sense of R packages.  R is Open Source - they are
>> not. Tacking on an R interface to proprietary software and saying it
>> is available in R is misleading and dishonest.
>>
>> On Wed, Sep 23, 2009 at 8:54 AM, Kevin Wright <kw.stat at gmail.com> wrote:
>>> Paul,
>>>
>>> It appears to me that the published timings you reference are
>>> comparing the __nlme__ package with other software.  So the answer is
>>> yes, nlme really is that slow for some models.  You are probably aware
>>> that the __lme4__ package has faster algorithms.
>>>
>>> There are many ways to fit mixed models in R including nlme, lme4,
>>> MCMCglmm, admb asreml, BUGS, etc.  If I was teaching a course, I would
>>> try to expose students to at least two of those in some detail and
>>> touch briefly on the others: nlme can fit a variety of complex
>>> varaiance structures, lme4 has faster algorithms, asreml is the only
>>> choice of animal/plant breeders and has commercial support, MCMCglmm
>>> has some Bayesian aspects and can fit some heteroskedastic variance
>>> structures, admb is used in Fish & Wildlife, etc.
>>>
>>> Mixed model fitting in R is definitely not a case of "one size fits all".
>>>
>>> Kevin Wright
>>>
>>>
>>> On Wed, Sep 23, 2009 at 1:36 AM, Paul Johnson <pauljohn32 at gmail.com> wrote:
>>>> Sent this to r-sig-debian by mistake the first time.  Depressing.
>>>>
>>>> 1.  One general question for general discussion:
>>>>
>>>> Is HLM6 faster than lmer? If so, why?
>>>>
>>>> I'm always advocating R to students, but some faculty members are
>>>> skeptical.  A colleague compared the commercial HLM6 software to lmer.
>>>>  HLM6 seems to fit the model in 1 second, but lmer takes 60 seconds.
>>>>
>>>> If you have HLM6 (I don't), can you tell me if you see similar differences?
>>>>
>>>> My first thought was that LM6 uses PQL by default, and it would be
>>>> faster.  However, in the output, HLM6 says:
>>>>
>>>> Method of estimation: restricted maximum likelihood
>>>>
>>>> But that doesn't tell me what quadrature approach they use, does it?
>>>>
>>>> Another explanation for the difference in time might be the way HLM6
>>>> saves the results of some matrix calculations and re-uses them behind
>>>> the scenes.  If every call to lmer is re-calculating some big matrix
>>>> results, I suppose that could explain it.
>>>>
>>>> There are comparisons from 2006 here
>>>>
>>>> http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/tables.shtml
>>>>
>>>> that indicate that lme was much slower than HLM, but that doesn't help
>>>> me understand *why* there is a difference.
>>>>
>>>> 2. What does "stack imbalance in .Call" mean in lmer?
>>>>
>>>> Here's why I ask.  Searching for comparisons of lmer and HLM,  I went
>>>> to CRAN &  I checked this document:
>>>>
>>>> http://cran.r-project.org/web/packages/mlmRev/vignettes/MlmSoftRev.pdf
>>>>
>>>> I *think* these things are automatically generated.  The version
>>>> that's up there at this moment  (mlmRev edition 0.99875-1)  has pages
>>>> full of the error message:
>>>>
>>>> stack imbalance in .Call,
>>>>
>>>> Were those always there?  I don't think so.   What do they mean?
>>>>
>>>> 3. In the HLM6 output, there is a message at the end of the variable list:
>>>>
>>>> '%' - This level-1 predictor has been centered around its grand mean.
>>>> '$' - This level-2 predictor has been centered around its grand mean.
>>>>
>>>> What effect does that have on the estimates?  I believe it should have
>>>> no effect on the fixed effect slope estimates, but it seems to me the
>>>> estimates of the variances of random parameters would be
>>>> changed.  In order to make the estimates from lmer as directly
>>>> comparable as possible, should I manually center all of the variables
>>>> before fitting the model?   I'm a little stumped on how to center a
>>>> multi-category factor before feeding it to lmer.  Know what I mean?
>>>>
>>>> pj
>>>>
>>>> --
>>>> Paul E. Johnson
>>>> Professor, Political Science
>>>> 1541 Lilac Lane, Room 504
>>>> University of Kansas
>>>>
>>>> _______________________________________________
>>>> R-sig-mixed-models at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>
>