[R-sig-ME] large dataset - lmer2 vector size specified is too large

Fri Mar 2 18:39:10 CET 2007

I can tell you the general approach I took when I worked on it a
couple of years ago-- admittedly, I think my application was
relatively simple, I'm not sure that my impl supported all of the
possible formula expressions, but I did my best.

I started with the SparseM package, which defines a matrix.csc class
(column-oriented storage) which is made up of three vectors- one of
length V ("values") containing the actual non-zero values for the
matrix, another of length V ("indices") containing the row indices of
the non-zero values, and another equal to the number of columns in the
original (non-sparse) matrix (plus 1, IIRC) which specified the
offsets for where the different columns started in the indices vector.
 This is a pretty common sparse matrix format, I can give an example
if it helps clarify the notion.

My general problem was one where I needed to solve a logistic
regression with millions of rows and one numerical variable (usually
representing a price) and a number of categorical variables (stored as
factors) that could potentially have thousands of distinct level
values.  The real challenge came when model.matrix would blow out the
factors into sets of 0/1 columns, the vast majority of which were
equal to zero.  Figuring out a way to get the factors directly into
the (sparse) representation of the model.matrix w/o blowing them out
into 0s and 1s first was the real trick, and I used two R functions to
do it- tapply to get the counts for the different levels of the factor
(so I knew how many 0/1 columns were to be generated for that factor)
and order to sort get back the indices that would sort the factor.  At
that point (at least for a simple example) I was essentially done--
the order command gave me back the indices of the rows that would need
to be non-zero for the 0/1 column corresponding to each level of the
factor, and the tapply gave me the counts so I would know how many
non-zeros were in each column.  Assuming an intercept term was
present, I ignored the first level of the factor, and inserted the
indices from order and the counts for tapply into the sparse matrix
structure for the rest.  That was essentially it- the running time was
O(n log n) due to the order call, but I only used storage space
proportional to twice the size of the original data frame, instead of
1000s of times larger than the original data frame, and once I could
hand the finished sparse model.matrix structure to the Cholesky
algorithm in SparseM, it ran very quickly.

Now there are a number of simplifications here- first, I did a call to
model.frame to use the matrix structure it created to parse the
formulas, and while I did incorporate features to handle interaction
terms between factors or numerics and factors or if the formula
omitted the intercept for some reason, I'm sure my implementation was
not robust enough for every expression that could be made with
formulas, but that was why I wanted to work with you guys in the first
place.  But I did write it entirely in R (leveraging the fortran impls
from SparseM) and it generated the correct results on the test sets I
compared it to (generally ones that regular glm could handle.)

Best,
Josh

On 3/2/07, Douglas Bates <bates at stat.wisc.edu> wrote:
> On 3/2/07, Gregor Gorjanc <gregor.gorjanc at bfro.uni-lj.si> wrote:
> > Douglas Bates <bates at ...> writes:
> > ...
> > > but even that formula means that you are estimating 448 fixed effects
> > > for which the model matrix is of size 448 * 43680 * 8 bytes (about 150
> > > MB).  In addition you are attempting to estimate
> >
> > Not directly related. Is it really necessary to build model matrix i.e. X.
>
> Necessary? For linear mixed models no - for generalized linear mixed
> models yes (think of how the IRLS algorithm works).
>
> Easy to change?  Regretably, no.
>
> > Building X'X directly would be more efficient i.e. only 448 * 448 * 8 bytes.
>
> That wouldn't be sufficient.  As described in the "Implementation"
> vignette in the lme4 package, the calculations in lmer2 are based on
> the matrix A which is [Z X -y]'[Z X -y].  You can't use just X'X - you
> need the crossproducts of all the columns of Z, X and y.
>
> The matrix A is potentially huge but sparse.  Assuming that Florian
> really does want to try to estimate a model with (affyID|cephID) then
> Z has 224 * 195 = 43680 columns.
>
> In lmer Z'Z is stored as a sparse matrix.  In lmer2 A is stored as a
> sparse matrix.
>
> Generating A as a sparse matrix directly from the data or directly
> from a model.frame object would result in considerable savings in
> memory usage but doing so is far from trivial.  Most useRs are not
> aware (nor need they be aware) of the complexity of the operations
> involved in taking a linear model formula and a data specification
> (and all the arguments like subset, na.action, offset, weights, ...
> that are expected to be available in model-fitting functions) and
> creating first a model.frame then a model.matrix.  It is one of those
> things that "just works" but is incredibly complicated beneath the
> surface.
>
> Even the first stage (producing a model frame) is complicated - to the
> extent that Thomas Lumley has a report on "The Standard Non-standard
> Evaluation in R" dealing with how arguments are evaluated in the
> model.frame function.
>
> There are various approaches that could be taken to building A as a
> sparse matrix directly from the model specification and a model frame.
>  Probably the easiest to implement would be to use the existing
> model.matrix code to produce horizontal sections of [Z X -y] and
> accumulate A from these sections.  This is still far from trivial
> because you either need two passes (one to determine the positions of
> the nonzeros and one to evaluate them) or you have to somehow update
> the structure of A on the fly.  The other thing that gets tricky is
> that the columns of X may change from section to section because not
> all levels of a factor may be used in a given section.  It would be
> necessary to ensure that the columns from all sections are aligned.
>