[R-sig-ME] LMM with Big data using binary DV

Thu Feb 9 03:28:13 CET 2012

Hi AC,

My personal preference would be glmer from the lme4 package.  I prefer
the Laplace approximation for the likelihood over the quasilikelihood
in glmmPQL.  To give some exemplary numbers, I simulated a dataset
with 2 million observations nested within 200 groups (10,000
observations per group).  I then ran an random intercepts model using:

system.time(m <- glmer(Y ~ X + W + (1 | G), family = "binomial"))

where the matrices/vectors are of sizes: Y = [2 million, 1]; X = [2
million, 6]; W = [2 million, 3]; G = [2 million, 1]

This took around 481 seconds to fit on a 1.6ghz dual core laptop.
With the OS and R running, my system used ~ 6GB of RAM for the model
and went up to ~7GB to show the summary (copies of the data are
made---changed in the upcoming version of lme4).

So as long as you have plenty of memory, you should have no trouble
modelling your data using glmer().  To initially make sure all your
code works, I might use a subset of your data (say 10k), once you are
convinced you have the model you want, run it on the full data.

Cheers,

Josh

On Wed, Feb 8, 2012 at 5:28 PM, AC Del Re <acdelre at stanford.edu> wrote:
> Hi,
>
> I have a huge dataset (2.5 million patients nested within  > 100
> facilities) and would like to examine variability across facilities in
> program utilization (0=n, 1=y; utilization rates are low in general), along
> with patient and facility predictors of utilization.
>
> I have 3 questions:
>
> 1. What program and/or package(s) do you recommend for running LMMs with
> big data (even if they are not R packages)?
>
> 2. Are there any clever work arounds (e.g., random sampling of subset of
> data, etc) that would allow me to use only R packages to run this dataset
> (assuming I need to use another program due to the size of the dataset)?
>
> 3. What type of LMM is recommended with a binary DV similar to the one I am
> wanting to examine? I know of two potential options (family=binomial option
> in lmer and the glmmPQL in the MASS package) but am not sure which is more
> appropriate or what other R packages and functions are available for this
> purpose?
>
> Thank you,
>
> AC
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/