[R-sig-ME] large dataset

Wed Jan 31 01:25:58 CET 2007

On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:

> I'm attempting to fit a crossed random effects model to a rather large
> data set.  This is EU parliament voting data (the response variable is
> binary) from 574 legislators over 2123 votes.  EU parliamentarians
> miss a lot of votes so there are ~700,000 total observations.  The
> model also includes quite a few covariates---on the order of 30-50
> (mostly fixed effects for country, party, etc), depending on the
> particular specification.  I'm having some serious issues fitting a
> crossed effects logit model to this data with lme4 without exhausting
> system memory.  I have a quad-core intel linux machine with 8 gigs of
> ram and a lot of swap to play with, but I'm still falling short.
> Interestingly, I've successfully fit this model using HLM6 on a
> machine with substantially less RAM.

> My question is largely about feasibility.  I would like to use lme4 to
> analyze this dataset because it provides a much better set of features
> for checking model fit and generating predictions than HLM (one can't
> even get the fixed effects variance-covariance matrix out of HLM6's
> crossed effects routine).  Is this impossible?  Are there any ways to
> reduce lmer's memory footprint that I might try?  Would one expect a
> cross-classified logit model with 700,000 observations to require
> upwards of 12 gigs of memory or have I uncovered a small memory leak
> that isn't visible with smaller datasets?  The memory use creeps up
> slowly over the course of a run which is at least consistent with a
> memory leak, but, not knowing anything about the implementation, I'm
> just speculating wildly here.  Obviously, I could sub-sample, but this
> is already a sample of a larger dataset, so I'm loathe to do that if I
> can avoid it.

Could you try to fit the response with a linear mixed model using the
lmer2 function that is in versions 0.9975-11 and later of the lme4
package?  I know the model is inappropriate but I just want to get a
handle on whether the mer2 representation saves enough storage to make
working with such a data set and model feasible.

I shouldn't speculate without actually examining the model fit myself
but I think the memory hog may be the fixed-effects model matrix.
Currently that model matrix must be created as  a dense matrix and it
must be created using all the rows.  When you say that you have 30-50
covariates (and I assume that some of them may be factors) then that
matrix could be the one that is breaking the bank.  In lmer2 the
fixed-effects model matrix is stored as a sparse matrix (although it
is initially created as a dense matrix).  The random-effects model
matrix is created as a sparse matrix and it usually isn't the problem
with memory usage.

If you do succeed in fitting a linear mixed model to these data using
lmer2 I would be interested in the sizes of some of the slots in the
fitted model.  I enclose a short transcript showing one way of
checking these sizes on an S4 object.

Regarding the possibility of a memory leak - I wouldn't be shocked if
I had managed to create a memory leak but the behavior that you
mention is consistent with the garbage collection.  At present the
optimization of the deviance for generalized linear mixed models goes
through the nlminb function in R which means that the deviance
evaluation must be an R function.  Thus there are R objects created
within the optimization that must be garbage collected.  I think I
know a way around this and it is on my "To Do" list to check it out
but that list is pretty long these days so I can't promise anything.

Thanks for writing to the list.  I'll be interested in whether it is
possible to work with such large data sets effectively.
-------------- next part --------------
> fm <- lmer2(math~sx*eth+gr+cltype+(yrs|id)+(1|tch)+(yrs|sch), star)
> nms <- slotNames(fm)
> names(nms) <- nms
> object.size(fm)
[1] 16820760
> sort(sapply(nms, function(nm) object.size(slot(fm, nm))))
      Gp    fixef       nc deviance     dims       ST   cnames     call 
      72      176      384      528      560     1072     1928     3744 
   terms    ranef  weights   offset    flist    frame     ZXyt        A 
    6168   184024   196664   196664   978952  3191264  3547072  3594864 
       L 
 4914248