[R-sig-ME] New version of lme4 on R-forge

Mon Mar 31 23:58:29 CEST 2008

I just committed files for lme4 version 0.999375-11 to the R-forge
source code repository.  If the package builds tonight are successful
then

install.packages("lme4", repos = "http://r-forge.r-project.org/")

should pick up the new version tomorrow.

The major change in this version is formalizing the distinction
between random effects terms and grouping factors for the random
effects.   Variance component parameters, specifically the parameters
represented in the list of matrices in the ST slot, are associated
with random-effects terms in the model formula.  So, using our
favorite examples,

fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)

has one random effects term while

fm2 <- lmer(Reaction ~ Days + (1|Subject) + (0+Days|Subject), sleepstudy)

has two independent random effects terms.  However there is only one
grouping factor, "Subject", in both fm1 and fm2.

In previous versions of the lme4 package the slot fm2 at flist would be a
data frame of two columns, both of which were Subject.  Now it has
just one column with an "assign" attribute that maps terms (in ST) to
factors (in flist).

This allows changing the value of ranef(fm2) to have the same form as
ranef(fm1).  In both cases it is a named list of matrices, one for
each grouping factor.  Indexing the value of ranef() by grouping
factors, not by terms, allows some of the peculiarities of the random
effects in models like fm2 to be eliminated.

To get this committed I needed to disable to effect of "postVar =
TRUE" in the ranef method.  I will reinstate that option after I work
out the details of the code.

There is also a new vignette in this version.  Like many of my
descriptions these Notes.pdf are incomplete but may be of interest to
some who are familiar with a generalized least squares (GLS)
representation of mixed-models (such as used in MLWin and HLM).  The
lme4 package uses a penalized least squares (PLS) representation
instead.  A PLS representation always has a corresponding GLS form
but, especially for models with crossed or partially crossed grouping
factors, the PLS representation is much more consise and easily
managed than the GLS representation.

I wrote these notes to describe to Andrew Robinson how to rewrite a
couple of expressions in the 1997 Biometrics paper by Kenward and
Roger.  This is the paper that introduced some approximate
distributions of some t and F statistics for mixed models that many
would like to see implemented in lme4.  As is typical in such papers,
there several examples included but all of these examples are for data
sets of modest size (not surprisingly - the importance of such
approximations is usually much less important in models fit to large
data sets).  Many people appear perplexed as to why I don't get off my
butt and implement these approximations.  Explanations of laziness
and/or incompetence suggest themselves.  Before opting for these
explanations, however, I would ask you to look at the formulas in that
paper.  The important matrices, written $\bm{\Sigma}$, $\bm{P}_i$ and
$\bm{Q}_{ij}$ (where i and j index the variance component parameters)
are all n by n, symmetric matrices ($n$ is the number of
observations).  I have fit models where $n$ is in the millions.
Working with n by n matrices with n in the millions is, shall we say,
problematic.

You may be able to evaluate expressions involving matrices like these
if you could count on them being very, very sparse.  Unfortunately, if
the grouping factors do not form a nested sequence the matrix
$\bm{\Sigma}$ would not be very sparse.  Even worse, $\bm{P}_i$ and
$\bm{Q}_{ij}$ are defined in terms of $\bm{\Sigma}^{-1}$, which would
likely end up being dense.  If my math is correct it would require
several terabytes of memory to store a dense, symmetric n by n matrix
when n is in the millions and you would apparently need several such
matrices.  It will require a few more iterations of Moore's Law before
we can compute such quantities using the equations in that form.
While we are waiting for computers with several terabytes of memory we
may want to see if those formulas can be rewritten in a more
computable form.

The PLS representation has the advantage relative to GLS of being
based on much smaller matrices.  In particular the Cholesky factor L
is the key matrix for PLS.  L is comparatively small (q by q, where q
is the total number of random effects) and it can be kept sparse, even
for models with crossed grouping factors.