[R-sig-ME] LMM with Big data using binary DV

Sat Feb 18 22:56:01 CET 2012

Apologies for the delay.  I had not saved the process to simulate the
data.  I created a new simulation and spent several days trying to get
it working on my laptop, without much success.  My first simulation
appears to have been something of a fluke in terms of speed.  The
present simulation used a data set of 200 groups with 5,000
replications each, 6 'observation level' predictors and 3 'group'
level predictors.  I backed down from 2 million to 1 million because
it still took over an hour to run and I was tired of waiting.
Curiously, lme4Eigen::glmer reports Inf and -Inf deviance and
loglikeihood values.  The parameter estimates from all three versions
are similar, though not exact.  With nAGQ = 0, the loglikelihood and
deviance between lme4Eigen and lme4 are similar.  Full code, timings,
and commented output form my runs are in the script, but the synopsis
is:

lme4::glmer
## > system.time(m1 <- glmer(dat$Y ~ dat$X + dat$W + (1 | dat$G),
family = "binomial"))
##    user  system elapsed
## 4158.10  117.19 4313.85

lme4Eigen::glmer
## > system.time(m1 <- glmer(dat$Y ~ dat$X + dat$W + (1 | dat$G),
family = "binomial"))
##    user  system elapsed
##  129.03    9.67  140.62
## infinite deviance, otherwise same ball park as the other two

lme4Eigen::glmer with nAGQ = 0
## > system.time(mfast <- glmer(dat$Y ~ dat$X + dat$W + (1 | dat$G),
family = "binomial", nAGQ = 0))
##    user  system elapsed
##  128.51    9.59  139.61

System characteristics:

R Under development (unstable) (2012-02-03 r58258)
Platform: x86_64-pc-mingw32/x64 (64-bit)
lme4Eigen_0.9996875-9
lme4_0.999375-42

Windows 7 x64; 6GB memory @ 1066MHZ; intel core i7 920 @ 2.66GHZ

On Thu, Feb 9, 2012 at 12:13 PM, Douglas Bates <bates at stat.wisc.edu> wrote:
> On Wed, Feb 8, 2012 at 8:28 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
>> Hi AC,
>>
>> My personal preference would be glmer from the lme4 package.  I prefer
>> the Laplace approximation for the likelihood over the quasilikelihood
>> in glmmPQL.  To give some exemplary numbers, I simulated a dataset
>> with 2 million observations nested within 200 groups (10,000
>> observations per group).  I then ran an random intercepts model using:
>>
>> system.time(m <- glmer(Y ~ X + W + (1 | G), family = "binomial"))
>>
>> where the matrices/vectors are of sizes: Y = [2 million, 1]; X = [2
>> million, 6]; W = [2 million, 3]; G = [2 million, 1]
>>
>> This took around 481 seconds to fit on a 1.6ghz dual core laptop.
>> With the OS and R running, my system used ~ 6GB of RAM for the model
>> and went up to ~7GB to show the summary (copies of the data are
>> made---changed in the upcoming version of lme4).
>>
>> So as long as you have plenty of memory, you should have no trouble
>> modelling your data using glmer().  To initially make sure all your
>> code works, I might use a subset of your data (say 10k), once you are
>> convinced you have the model you want, run it on the full data.
>
> If you would have an opportunity to run that model fit or a comparable
> on lme4Eigen::glmer we would appreciate information about speed,
> accuracy and memory usage.
>
> In lme4Eigen::glmer there are different levels of precision in the
> approximation to the deviance being optimizer.  These are controlled
> by the nAGQ argument to the function.  The default, nAGQ=1, uses the
> Laplace approximation.  The special value nAGQ=0 also uses the Laplace
> approximation but profiles out the fixed-effects parameters.  This
> profiling is not exact but usually gets you close to the optimum that
> you would get from nAGQ=1, but much, much faster.  In a model like
> this you can also use nAGQ>1 and <= 25.  On the model fits we have
> tried we don't see a lot of difference in timing between, say, nAGQ=9
> and nAGQ=25 but on a model fit like this you might.
>
> As a fallback, we would appreciate the code that you used to simulate
> the response.  We could generate something ourselves, of course, but
> it is easier to compare when you copy someone else's simulation.
>> On Wed, Feb 8, 2012 at 5:28 PM, AC Del Re <acdelre at stanford.edu> wrote:
>>> Hi,
>>>
>>> I have a huge dataset (2.5 million patients nested within  > 100
>>> facilities) and would like to examine variability across facilities in
>>> program utilization (0=n, 1=y; utilization rates are low in general), along
>>> with patient and facility predictors of utilization.
>>>
>>> I have 3 questions:
>>>
>>> 1. What program and/or package(s) do you recommend for running LMMs with
>>> big data (even if they are not R packages)?
>>>
>>> 2. Are there any clever work arounds (e.g., random sampling of subset of
>>> data, etc) that would allow me to use only R packages to run this dataset
>>> (assuming I need to use another program due to the size of the dataset)?
>>>
>>> 3. What type of LMM is recommended with a binary DV similar to the one I am
>>> wanting to examine? I know of two potential options (family=binomial option
>>> in lmer and the glmmPQL in the MASS package) but am not sure which is more
>>> appropriate or what other R packages and functions are available for this
>>> purpose?
>>>
>>> Thank you,
>>>
>>> AC
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>>
>>
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> Programmer Analyst II, Statistical Consulting Group
>> University of California, Los Angeles
>> https://joshuawiley.com/
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/