[R-sig-ME] large dataset

Wed Jan 31 23:02:59 CET 2007

On 1/31/07, Dan Pemstein <dbp at uiuc.edu> wrote:
> Thanks for replying so quickly.
>
> It may take me a couple of days to get the new version of lme4
> installed and to run the tests you're interested in.  In the
> mean-time, I ran (using lmer in 0.9975-11):
>
> A fixed intercept + crossed random intercepts only model
>   - Ran out of memory + swap.
> A votes intercept only model with all the covariates
>   - Completed.  Topped out at around 6 gigs of memory and this topping
>     out occurred at the end of the run, after verbose iteration output
>     had completed.  I'm not sure if my earlier full model runs crashed
>     at this point as well, but it is a distinct possibility.
>
> Both these runs were fit using PQL.  One of my full runs used laplace
> and ran out of memory after 20-odd iterations and 12+ hours of
> processor time.
>
> Here are the sizes for the single random intercept model:
>
> > nms<-slotNames(mod1)
> > object.size(mod1)
> [1] 748925256
> > sort(sapply(nms, function(nm) object.size(slot(mod1, nm))))
>        Gp        nc  deviance    status  gradComp   devComp       Xty       rXy
>        48       256       320       384       496       728       960       960
>     fixef     Omega      call    cnames       Zty       rZy     ranef     terms
>       960      3552      6528      7776     17032     17032     17032     17096
>      bVar    family       ZtZ         L       XtX       RXX       ZtX       RZX
>     17456     31448     35616     69880    121824    121880   1962544   1962544
>    RZXinv     flist         y       wts    wrkres        Zt   weights     frame
>   1962544   2602096   4965032   4965032   4965032   9931408  39720128  69657216
>         X
> 605738248
>
> I'll post the results of an lmer2 run to the list once I have a
> chance.

Thanks.  That result by itself can tell us where the problem lies.
Notice that the size of the X slot is about 600MB out of the total of
about 750 MB.  By comparison, the other slots like XtX, ZtZ and ZtX
are much smaller.

If you still have that model fit available could you check

library(Matrix)
object.size(as(mod1$X, "sparseMatrix"))

The good news from this example is that it gives us hope for fitting
mixed models to large data sets.  The bad news is that doing so will
require a considerable amount of development.

> On Tue, Jan 30, 2007 at 06:38:09PM -0600, Douglas Bates wrote:
> > I should have mentioned this in my earlier reply - please use version
> > 0.9975-12 of lme4 when checking with lmer2.  I just uploaded this
> > version to CRAN and it should appear on the main site and the mirrors
> > in a day or two.  You can get it now from the SVN archive
> >
> > https://svn.r-project.org/R-packages/trunk/lme4
> >
> > In an earlier thread on this list Andrew Robinson described how he was
> > unable to run even the simplest examples of lmer2 on a FreeBSD system.
> > With his help we finally tracked down the dumb error that I had made
> > in the C function mer2_getPars and fixed it for the -12 release.
> > Under Linux the bug was not causing a memory error but it certainly
> > would use up much more memory than necessary during the iterations.
> >
> >
> > On 1/30/07, Douglas Bates <bates at stat.wisc.edu> wrote:
> > >On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:
> > >
> > >> I'm attempting to fit a crossed random effects model to a rather large
> > >> data set.  This is EU parliament voting data (the response variable is
> > >> binary) from 574 legislators over 2123 votes.  EU parliamentarians
> > >> miss a lot of votes so there are ~700,000 total observations.  The
> > >> model also includes quite a few covariates---on the order of 30-50
> > >> (mostly fixed effects for country, party, etc), depending on the
> > >> particular specification.  I'm having some serious issues fitting a
> > >> crossed effects logit model to this data with lme4 without exhausting
> > >> system memory.  I have a quad-core intel linux machine with 8 gigs of
> > >> ram and a lot of swap to play with, but I'm still falling short.
> > >> Interestingly, I've successfully fit this model using HLM6 on a
> > >> machine with substantially less RAM.
> > >
> > >> My question is largely about feasibility.  I would like to use lme4 to
> > >> analyze this dataset because it provides a much better set of features
> > >> for checking model fit and generating predictions than HLM (one can't
> > >> even get the fixed effects variance-covariance matrix out of HLM6's
> > >> crossed effects routine).  Is this impossible?  Are there any ways to
> > >> reduce lmer's memory footprint that I might try?  Would one expect a
> > >> cross-classified logit model with 700,000 observations to require
> > >> upwards of 12 gigs of memory or have I uncovered a small memory leak
> > >> that isn't visible with smaller datasets?  The memory use creeps up
> > >> slowly over the course of a run which is at least consistent with a
> > >> memory leak, but, not knowing anything about the implementation, I'm
> > >> just speculating wildly here.  Obviously, I could sub-sample, but this
> > >> is already a sample of a larger dataset, so I'm loathe to do that if I
> > >> can avoid it.
> > >
> > >Could you try to fit the response with a linear mixed model using the
> > >lmer2 function that is in versions 0.9975-11 and later of the lme4
> > >package?  I know the model is inappropriate but I just want to get a
> > >handle on whether the mer2 representation saves enough storage to make
> > >working with such a data set and model feasible.
> > >
> > >I shouldn't speculate without actually examining the model fit myself
> > >but I think the memory hog may be the fixed-effects model matrix.
> > >Currently that model matrix must be created as  a dense matrix and it
> > >must be created using all the rows.  When you say that you have 30-50
> > >covariates (and I assume that some of them may be factors) then that
> > >matrix could be the one that is breaking the bank.  In lmer2 the
> > >fixed-effects model matrix is stored as a sparse matrix (although it
> > >is initially created as a dense matrix).  The random-effects model
> > >matrix is created as a sparse matrix and it usually isn't the problem
> > >with memory usage.
> > >
> > >If you do succeed in fitting a linear mixed model to these data using
> > >lmer2 I would be interested in the sizes of some of the slots in the
> > >fitted model.  I enclose a short transcript showing one way of
> > >checking these sizes on an S4 object.
> > >
> > >Regarding the possibility of a memory leak - I wouldn't be shocked if
> > >I had managed to create a memory leak but the behavior that you
> > >mention is consistent with the garbage collection.  At present the
> > >optimization of the deviance for generalized linear mixed models goes
> > >through the nlminb function in R which means that the deviance
> > >evaluation must be an R function.  Thus there are R objects created
> > >within the optimization that must be garbage collected.  I think I
> > >know a way around this and it is on my "To Do" list to check it out
> > >but that list is pretty long these days so I can't promise anything.
> > >
> > >Thanks for writing to the list.  I'll be interested in whether it is
> > >possible to work with such large data sets effectively.
> > >
> > >
> > >
> >
>
> --
> Daniel Pemstein
> Department of Political Science
> University of Illinois at Urbana-Champaign
> 702 S. Wright St.
> Urbana, IL 61801
>
> Email: dbp at uiuc.edu
>
>
>