[R-sig-ME] LMM with Big data using binary DV

Thu Feb 9 21:13:24 CET 2012

On Wed, Feb 8, 2012 at 8:28 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
> Hi AC,
>
> My personal preference would be glmer from the lme4 package.  I prefer
> the Laplace approximation for the likelihood over the quasilikelihood
> in glmmPQL.  To give some exemplary numbers, I simulated a dataset
> with 2 million observations nested within 200 groups (10,000
> observations per group).  I then ran an random intercepts model using:
>
> system.time(m <- glmer(Y ~ X + W + (1 | G), family = "binomial"))
>
> where the matrices/vectors are of sizes: Y = [2 million, 1]; X = [2
> million, 6]; W = [2 million, 3]; G = [2 million, 1]
>
> This took around 481 seconds to fit on a 1.6ghz dual core laptop.
> With the OS and R running, my system used ~ 6GB of RAM for the model
> and went up to ~7GB to show the summary (copies of the data are
> made---changed in the upcoming version of lme4).
>
> So as long as you have plenty of memory, you should have no trouble
> modelling your data using glmer().  To initially make sure all your
> code works, I might use a subset of your data (say 10k), once you are
> convinced you have the model you want, run it on the full data.

If you would have an opportunity to run that model fit or a comparable
on lme4Eigen::glmer we would appreciate information about speed,
accuracy and memory usage.

In lme4Eigen::glmer there are different levels of precision in the
approximation to the deviance being optimizer.  These are controlled
by the nAGQ argument to the function.  The default, nAGQ=1, uses the
Laplace approximation.  The special value nAGQ=0 also uses the Laplace
approximation but profiles out the fixed-effects parameters.  This
profiling is not exact but usually gets you close to the optimum that
you would get from nAGQ=1, but much, much faster.  In a model like
this you can also use nAGQ>1 and <= 25.  On the model fits we have
tried we don't see a lot of difference in timing between, say, nAGQ=9
and nAGQ=25 but on a model fit like this you might.

As a fallback, we would appreciate the code that you used to simulate
the response.  We could generate something ourselves, of course, but
it is easier to compare when you copy someone else's simulation.
> On Wed, Feb 8, 2012 at 5:28 PM, AC Del Re <acdelre at stanford.edu> wrote:
>> Hi,
>>
>> I have a huge dataset (2.5 million patients nested within  > 100
>> facilities) and would like to examine variability across facilities in
>> program utilization (0=n, 1=y; utilization rates are low in general), along
>> with patient and facility predictors of utilization.
>>
>> I have 3 questions:
>>
>> 1. What program and/or package(s) do you recommend for running LMMs with
>> big data (even if they are not R packages)?
>>
>> 2. Are there any clever work arounds (e.g., random sampling of subset of
>> data, etc) that would allow me to use only R packages to run this dataset
>> (assuming I need to use another program due to the size of the dataset)?
>>
>> 3. What type of LMM is recommended with a binary DV similar to the one I am
>> wanting to examine? I know of two potential options (family=binomial option
>> in lmer and the glmmPQL in the MASS package) but am not sure which is more
>> appropriate or what other R packages and functions are available for this
>> purpose?
>>
>> Thank you,
>>
>> AC
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models