[R-sig-ME] large dataset

Wed Jan 31 01:38:09 CET 2007

I should have mentioned this in my earlier reply - please use version
0.9975-12 of lme4 when checking with lmer2.  I just uploaded this
version to CRAN and it should appear on the main site and the mirrors
in a day or two.  You can get it now from the SVN archive

https://svn.r-project.org/R-packages/trunk/lme4

In an earlier thread on this list Andrew Robinson described how he was
unable to run even the simplest examples of lmer2 on a FreeBSD system.
 With his help we finally tracked down the dumb error that I had made
in the C function mer2_getPars and fixed it for the -12 release.
Under Linux the bug was not causing a memory error but it certainly
would use up much more memory than necessary during the iterations.

On 1/30/07, Douglas Bates <bates at stat.wisc.edu> wrote:
> On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:
>
> > I'm attempting to fit a crossed random effects model to a rather large
> > data set.  This is EU parliament voting data (the response variable is
> > binary) from 574 legislators over 2123 votes.  EU parliamentarians
> > miss a lot of votes so there are ~700,000 total observations.  The
> > model also includes quite a few covariates---on the order of 30-50
> > (mostly fixed effects for country, party, etc), depending on the
> > particular specification.  I'm having some serious issues fitting a
> > crossed effects logit model to this data with lme4 without exhausting
> > system memory.  I have a quad-core intel linux machine with 8 gigs of
> > ram and a lot of swap to play with, but I'm still falling short.
> > Interestingly, I've successfully fit this model using HLM6 on a
> > machine with substantially less RAM.
>
> > My question is largely about feasibility.  I would like to use lme4 to
> > analyze this dataset because it provides a much better set of features
> > for checking model fit and generating predictions than HLM (one can't
> > even get the fixed effects variance-covariance matrix out of HLM6's
> > crossed effects routine).  Is this impossible?  Are there any ways to
> > reduce lmer's memory footprint that I might try?  Would one expect a
> > cross-classified logit model with 700,000 observations to require
> > upwards of 12 gigs of memory or have I uncovered a small memory leak
> > that isn't visible with smaller datasets?  The memory use creeps up
> > slowly over the course of a run which is at least consistent with a
> > memory leak, but, not knowing anything about the implementation, I'm
> > just speculating wildly here.  Obviously, I could sub-sample, but this
> > is already a sample of a larger dataset, so I'm loathe to do that if I
> > can avoid it.
>
> Could you try to fit the response with a linear mixed model using the
> lmer2 function that is in versions 0.9975-11 and later of the lme4
> package?  I know the model is inappropriate but I just want to get a
> handle on whether the mer2 representation saves enough storage to make
> working with such a data set and model feasible.
>
> I shouldn't speculate without actually examining the model fit myself
> but I think the memory hog may be the fixed-effects model matrix.
> Currently that model matrix must be created as  a dense matrix and it
> must be created using all the rows.  When you say that you have 30-50
> covariates (and I assume that some of them may be factors) then that
> matrix could be the one that is breaking the bank.  In lmer2 the
> fixed-effects model matrix is stored as a sparse matrix (although it
> is initially created as a dense matrix).  The random-effects model
> matrix is created as a sparse matrix and it usually isn't the problem
> with memory usage.
>
> If you do succeed in fitting a linear mixed model to these data using
> lmer2 I would be interested in the sizes of some of the slots in the
> fitted model.  I enclose a short transcript showing one way of
> checking these sizes on an S4 object.
>
> Regarding the possibility of a memory leak - I wouldn't be shocked if
> I had managed to create a memory leak but the behavior that you
> mention is consistent with the garbage collection.  At present the
> optimization of the deviance for generalized linear mixed models goes
> through the nlminb function in R which means that the deviance
> evaluation must be an R function.  Thus there are R objects created
> within the optimization that must be garbage collected.  I think I
> know a way around this and it is on my "To Do" list to check it out
> but that list is pretty long these days so I can't promise anything.
>
> Thanks for writing to the list.  I'll be interested in whether it is
> possible to work with such large data sets effectively.
>
>
>