[R-sig-ME] LMM with Big data using binary DV

Fri Feb 10 18:36:10 CET 2012

On Fri, Feb 10, 2012 at 6:20 AM, Malcolm Fairbrother
<m.fairbrother at bristol.ac.uk> wrote:
> Dear AC (and perhaps Doug),
>
>>>> 2. Are there any clever work arounds (e.g., random sampling of subset of
>>>> data, etc) that would allow me to use only R packages to run this dataset
>>>> (assuming I need to use another program due to the size of the dataset)?
>
> You may be well aware of this, but one way of substantially speeding up the estimation of models with binary data is to use "cbind", though the feasibility of this depends on the nature of the model (number of predictors and number of unique values for each predictor variable). The kind of code you need for this is below. You'll see that the fixed effects estimates, standard errors, and random effects variances turn out the same.
>
>> If you would have an opportunity to run that model fit or a comparable
>> on lme4Eigen::glmer we would appreciate information about speed,
>> accuracy and memory usage.
>
>> As a fallback, we would appreciate the code that you used to simulate
>> the response.  We could generate something ourselves, of course, but
>> it is easier to compare when you copy someone else's simulation.
>
> I've compared the speed of lme4 versus lme4Eigen, on a simulated dataset with 100,000 observations, using a 2GHz MacBook. Based on a handful of simulations, there doesn't appear to be much difference between the two packages in terms of speed (sometimes one is faster, sometimes the other). I have reported the results of one simulation here. The two packages generate identical results for this dataset.
>
> Cheers,
> Malcolm
>
>
> N <- 100000
> grps <- 100
> dat <- data.frame(x1 = sample(1:10, N, replace=T), x2 = sample(18:23, N, replace=T), grp=rep(1:grps, each=N/grps))
> dat$y <- rbinom(N, prob = plogis(-5 + 0.1*dat$x1 + 0.2*dat$x2 + rnorm(grps)[dat$grp]), size = 1)
> failures <- by(dat, list(dat$x1, dat$x2, dat$grp), function(x) sum(x$y==0))
> successes <- by(dat, list(dat$x1, dat$x2, dat$grp), function(x) sum(x$y==1))
> dat2 <- expand.grid(x1=sort(unique(dat$x1)), x2=sort(unique(dat$x2)), grp=sort(unique(dat$grp)))
> dat2$failures <- as.vector(failures)
> dat2$successes <- as.vector(successes)
> library(lme4)
> system.time(glmer(y ~ x1 + x2 + (1 | grp), dat, family=binomial))
> #   user  system elapsed
> # 22.918   0.660  24.441
> system.time(glmer(cbind(successes, failures) ~ x1 + x2 + (1 | grp), dat2, family=binomial))
> #   user  system elapsed
> #  1.833   0.017   1.855
> detach("package:lme4")
> library(lme4Eigen)
> system.time(glmer(y ~ x1 + x2 + (1 | grp), dat, family=binomial))
> #   user  system elapsed
> # 24.824   1.811  26.773
> system.time(glmer(cbind(successes, failures) ~ x1 + x2 + (1 | grp), dat2, family=binomial))
> #   user  system elapsed
> #  1.687   0.039   1.723

Thanks for sending that, Malcolm.

Your collapsing of the binary responses to the number of successes and
failures works in this case by can't be expected to work in general.
In establishing dat2 using expand.grid you are assuming that all
combinations of covariates occur in the data.  If not you would end up
with a successes=0, failures=0 row and that might cause problems in
glm or glmer finding the proportion of successes (I havent' gone back
to look at the code to determine this).  I am trying to think of a way
of doing this using the type of strategy in the hidden function
duplicated.data.frame.  If you have the model frame and you know that
there are only two unique values for the response you can get the
counts by determining the unique combinations of covariates and using
xtabs. If you look at duplicated.data.frame, you will see that getting
the unique combinations is done by pasting the text representation of
the row using a separator that will not occur in numeric data.  The C
function do_duplicated uses hash tables and hashing a character string
is very fast.

The end result looks like the enclosed.

Of course, there is probably a much cleaner way of doing this using
Hadley Wickham's reshape package but I haven't studied that package
enough yet.

As Malcolm said, the two sets of parameter estimates are very similar
but the deviance is different.  I am working on changes that will
create the proper value of the deviance from the model in terms of
successes and failure.  It is very confusing - the function in the glm
family that produces the deviance is called "aic" and the function
called "dev.resids" actually produces the square of the deviance
residuals and their sum should be the glm deviance, except when it
isn't.  It's disheartening at best.

I also include a timing of the model fit using nAGQ=25 which should be
a very accurate evaluation of the deviance.

The other version with nAGQ=0 uses a Laplace approximation and
(approximately) profiles out the fixed-effects parameters.  In many
situations this gets close enough to the correct deviance that the
results can be used for rough model comparisons.

>> Date: Thu, 9 Feb 2012 14:13:24 -0600
>> From: Douglas Bates <bates at stat.wisc.edu>
>> To: Joshua Wiley <jwiley.psych at gmail.com>
>> Cc: AC Del Re <acdelre at stanford.edu>, r-sig-mixed-models at r-project.org
>> Subject: Re: [R-sig-ME] LMM with Big data using binary DV
>>
>> On Wed, Feb 8, 2012 at 8:28 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
>>> Hi AC,
>>>
>>> My personal preference would be glmer from the lme4 package. ?I prefer
>>> the Laplace approximation for the likelihood over the quasilikelihood
>>> in glmmPQL. ?To give some exemplary numbers, I simulated a dataset
>>> with 2 million observations nested within 200 groups (10,000
>>> observations per group). ?I then ran an random intercepts model using:
>>>
>>> system.time(m <- glmer(Y ~ X + W + (1 | G), family = "binomial"))
>>>
>>> where the matrices/vectors are of sizes: Y = [2 million, 1]; X = [2
>>> million, 6]; W = [2 million, 3]; G = [2 million, 1]
>>>
>>> This took around 481 seconds to fit on a 1.6ghz dual core laptop.
>>> With the OS and R running, my system used ~ 6GB of RAM for the model
>>> and went up to ~7GB to show the summary (copies of the data are
>>> made---changed in the upcoming version of lme4).
>>>
>>> So as long as you have plenty of memory, you should have no trouble
>>> modelling your data using glmer(). ?To initially make sure all your
>>> code works, I might use a subset of your data (say 10k), once you are
>>> convinced you have the model you want, run it on the full data.
>>
>> If you would have an opportunity to run that model fit or a comparable
>> on lme4Eigen::glmer we would appreciate information about speed,
>> accuracy and memory usage.
>>
>> In lme4Eigen::glmer there are different levels of precision in the
>> approximation to the deviance being optimizer.  These are controlled
>> by the nAGQ argument to the function.  The default, nAGQ=1, uses the
>> Laplace approximation.  The special value nAGQ=0 also uses the Laplace
>> approximation but profiles out the fixed-effects parameters.  This
>> profiling is not exact but usually gets you close to the optimum that
>> you would get from nAGQ=1, but much, much faster.  In a model like
>> this you can also use nAGQ>1 and <= 25.  On the model fits we have
>> tried we don't see a lot of difference in timing between, say, nAGQ=9
>> and nAGQ=25 but on a model fit like this you might.
>>
>> As a fallback, we would appreciate the code that you used to simulate
>> the response.  We could generate something ourselves, of course, but
>> it is easier to compare when you copy someone else's simulation.
>>> On Wed, Feb 8, 2012 at 5:28 PM, AC Del Re <acdelre at stanford.edu> wrote:
>>>> Hi,
>>>>
>>>> I have a huge dataset (2.5 million patients nested within ?> 100
>>>> facilities) and would like to examine variability across facilities in
>>>> program utilization (0=n, 1=y; utilization rates are low in general), along
>>>> with patient and facility predictors of utilization.
>>>>
>>>> I have 3 questions:
>>>>
>>>> 1. What program and/or package(s) do you recommend for running LMMs with
>>>> big data (even if they are not R packages)?
>>>>
>>>> 2. Are there any clever work arounds (e.g., random sampling of subset of
>>>> data, etc) that would allow me to use only R packages to run this dataset
>>>> (assuming I need to use another program due to the size of the dataset)?
>>>>
>>>> 3. What type of LMM is recommended with a binary DV similar to the one I am
>>>> wanting to examine? I know of two potential options (family=binomial option
>>>> in lmer and the glmmPQL in the MASS package) but am not sure which is more
>>>> appropriate or what other R packages and functions are available for this
>>>> purpose?
>>>>
>>>> Thank you,
>>>>
>>>> AC
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-------------- next part --------------

R version 2.14.1 (2011-12-22)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> N <- 100000
> ngrps <- 100
> set.seed(101)
> str(dat <- within(data.frame(x1   = sample(1:10, N, replace=TRUE),
+                              x2   = sample(18:23, N, replace=TRUE),
+                              grps = factor(sample.int(ngrps, N, replace=TRUE))),
+               {
+                   y <- rbinom(N,
+                               prob = plogis(-5 + 0.1 * x1 + 0.2 * x2 + rnorm(ngrps)[grps]),
+                               size = 1)
+                   covs <- do.call(paste, c(model.frame(~ x1 + x2 + grps), sep="\r"))
+               }))
'data.frame':	100000 obs. of  5 variables:
 $ x1  : int  4 1 8 7 3 4 6 4 7 6 ...
 $ x2  : int  21 20 19 21 19 22 23 22 22 23 ...
 $ grps: Factor w/ 100 levels "1","2","3","4",..: 94 13 40 95 43 53 31 72 43 21 ...
 $ covs: chr  "4\r21\r94" "1\r20\r13" "8\r19\r40" "7\r21\r95" ...
 $ y   : num  0 0 1 0 0 1 1 0 0 1 ...
> head(dat)
  x1 x2 grps      covs y
1  4 21   94 4\r21\r94 0
2  1 20   13 1\r20\r13 0
3  8 19   40 8\r19\r40 1
4  7 21   95 7\r21\r95 0
5  3 19   43 3\r19\r43 0
6  4 22   53 4\r22\r53 1
> str(freq <- xtabs(~ covs + y, dat))
 xtabs [1:6000, 1:2] 10 11 12 5 7 1 12 5 6 6 ...
 - attr(*, "dimnames")=List of 2
  ..$ covs: chr [1:6000] "10\r18\r1" "10\r18\r10" "10\r18\r100" "10\r18\r11" ...
  ..$ y   : chr [1:2] "0" "1"
 - attr(*, "class")= chr [1:2] "xtabs" "table"
 - attr(*, "call")= language xtabs(formula = ~covs + y, data = dat)
> head(freq)
             y
covs           0 1
  10\r18\r1   10 7
  10\r18\r10  11 1
  10\r18\r100 12 3
  10\r18\r11   5 7
  10\r18\r12   7 9
  10\r18\r13   1 6
> str(dat2 <- within(dat[!duplicated(dat$covs),],
+                {
+                    failure <- freq[covs, 1]
+                    success <- freq[covs, 2]
+                    y       <- NULL
+                    covs    <- NULL
+                }))
'data.frame':	6000 obs. of  5 variables:
 $ x1     : int  4 1 8 7 3 4 6 4 7 6 ...
 $ x2     : int  21 20 19 21 19 22 23 22 22 23 ...
 $ grps   : Factor w/ 100 levels "1","2","3","4",..: 94 13 40 95 43 53 31 72 43 21 ...
 $ success: num  4 8 12 6 4 8 11 9 4 12 ...
 $ failure: num  10 9 10 13 19 10 3 10 15 5 ...
> head(dat2)
  x1 x2 grps success failure
1  4 21   94       4      10
2  1 20   13       8       9
3  8 19   40      12      10
4  7 21   95       6      13
5  3 19   43       4      19
6  4 22   53       8      10
> xtabs(~ y, subset(dat, covs == "4\r21\r94"))  # corresponds to first row in dat2
y
 0  1 
10  4 
> 
> library(lme4Eigen)
Loading required package: lattice
> ## using the full data frame, dat
> 
> ## default fit using the Laplace approximation
> system.time(print(gm01a <- glmer(y ~ x1 + x2 + (1 | grps), dat, binomial), corr=FALSE))
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Formula: y ~ x1 + x2 + (1 | grps) 
   Data: dat 

      AIC       BIC    logLik  deviance 
113146.67 113184.73 -56569.34 113138.67 

Random effects:
 Groups Name        Variance Std.Dev.
 grps   (Intercept) 1.073    1.036   
Number of obs: 100000, groups: grps, 100

Fixed effects:
             Estimate Std. Error z value
(Intercept) -5.039837   0.137784  -36.58
x1           0.102856   0.002550   40.33
x2           0.195406   0.004307   45.37
   user  system elapsed 
 32.250   0.084  32.327 
> 
> ## faster but less accurate version that profiles out the fixed-effects parameters
> system.time(print(gm01b <- glmer(y ~ x1 + x2 + (1 | grps), dat, binomial, nAGQ=0L), corr=FALSE))
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Formula: y ~ x1 + x2 + (1 | grps) 
   Data: dat 

      AIC       BIC    logLik  deviance 
113146.68 113184.74 -56569.34 113138.68 

Random effects:
 Groups Name        Variance Std.Dev.
 grps   (Intercept) 1.077    1.038   
Number of obs: 100000, groups: grps, 100

Fixed effects:
             Estimate Std. Error z value
(Intercept) -5.034540   0.137930  -36.50
x1           0.102753   0.002550   40.30
x2           0.195209   0.004306   45.33
   user  system elapsed 
  4.824   0.020   4.841 
> 
> ## slowest but most accurate version
> system.time(print(gm01c <- glmer(y ~ x1 + x2 + (1 | grps), dat, binomial, nAGQ=25L), corr=FALSE))
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Formula: y ~ x1 + x2 + (1 | grps) 
   Data: dat 

     AIC      BIC   logLik deviance 
113146.6 113184.7 -56569.3 113138.6 

Random effects:
 Groups Name        Variance Std.Dev.
 grps   (Intercept) 1.073    1.036   
Number of obs: 100000, groups: grps, 100

Fixed effects:
             Estimate Std. Error z value
(Intercept) -5.039829   0.137783  -36.58
x1           0.102856   0.002550   40.33
x2           0.195406   0.004307   45.37
   user  system elapsed 
 81.589   0.088  81.726 
> 
> ## using the reduced data frame, dat2
> 
> ## default fit using the Laplace approximation
> system.time(print(gmsda <- glmer(cbind(success,failure) ~ x1 + x2 + (1 | grps), dat2, binomial),
+                   corr=FALSE))
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Formula: cbind(success, failure) ~ x1 + x2 + (1 | grps) 
   Data: dat2 

      AIC       BIC    logLik  deviance 
 6890.896  6917.694 -3441.448  6882.896 

Random effects:
 Groups Name        Variance Std.Dev.
 grps   (Intercept) 1.073    1.036   
Number of obs: 6000, groups: grps, 100

Fixed effects:
             Estimate Std. Error z value
(Intercept) -5.039823   0.137783  -36.58
x1           0.102857   0.002550   40.33
x2           0.195406   0.004307   45.37
   user  system elapsed 
  1.508   0.012   1.513 
> ## fastest version profiling out the fixed-effects parameters
> system.time(print(gmsdb <- glmer(cbind(success,failure) ~ x1 + x2 + (1 | grps), dat2,
+                                  binomial, nAGQ=0L), corr=FALSE))
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Formula: cbind(success, failure) ~ x1 + x2 + (1 | grps) 
   Data: dat2 

      AIC       BIC    logLik  deviance 
 6890.901  6917.699 -3441.450  6882.901 

Random effects:
 Groups Name        Variance Std.Dev.
 grps   (Intercept) 1.073    1.036   
Number of obs: 6000, groups: grps, 100

Fixed effects:
             Estimate Std. Error z value
(Intercept) -5.034500   0.137780  -36.54
x1           0.102752   0.002550   40.30
x2           0.195207   0.004306   45.33
   user  system elapsed 
  0.220   0.008   0.223 
> ## slower but most accurate version
> system.time(print(gmsd25 <- glmer(cbind(success,failure) ~ x1 + x2 + (1 | grps), dat2,
+                                   binomial, nAGQ=25L), corr=FALSE))
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Formula: cbind(success, failure) ~ x1 + x2 + (1 | grps) 
   Data: dat2 

      AIC       BIC    logLik  deviance 
 6890.827  6917.625 -3441.414  6882.827 

Random effects:
 Groups Name        Variance Std.Dev.
 grps   (Intercept) 1.073    1.036   
Number of obs: 6000, groups: grps, 100

Fixed effects:
             Estimate Std. Error z value
(Intercept) -5.039829   0.137783  -36.58
x1           0.102857   0.002550   40.33
x2           0.195406   0.004307   45.37
   user  system elapsed 
  6.113   0.004   6.112 
> 
> proc.time()
   user  system elapsed 
131.684   1.040 131.792