[R-sig-ME] Computational speed - MCMCglmm/lmer

Sat Jun 19 18:50:24 CEST 2010

Thanks for making the data available, Dave.  I will be very interested
in seeing if I can make glmer more effective in fitting this model.

Right now I have a quick question, what are genders 0, 1, 2?  I
checked that if gender is consistent within subjects, which it is, so
you have 549 subjects of gender 0, 439 of gender 1 and 2 of gender 2.
If I wanted to label them as "M", "F" and "?" which labels should I
apply to genders 0 and 1 and what does gender 2 mean?

> drink.df <- within(read.delim("http://depts.washington.edu/cshrb/newweb/stats%20documents/tlfb.txt"),
+                    {
+                        id      <- factor(id)
+                        gender  <- factor(gender)
+                        weekday <- factor(weekday, labels = c("N","Y"))
+                    })
> str(drink.df) # 57K rows
'data.frame':	57318 obs. of  4 variables:
 $ id     : Factor w/ 990 levels "1014503","1017782",..: 1 1 1 2 2 2 2 2 2 2 ...
 $ gender : Factor w/ 3 levels "0","1","2": 1 1 1 2 2 2 2 2 2 2 ...
 $ weekday: Factor w/ 2 levels "N","Y": 2 1 1 2 1 1 2 2 2 2 ...
 $ drinks : int  0 0 5 8 5 10 0 0 0 0 ...
> summary(drink.df)
       id        gender    weekday       drinks
 8396125:  435   0:32445   N:16633   Min.   : 0.0000
 6551092:  180   1:24780   Y:40685   1st Qu.: 0.0000
 1028078:   90   2:   93             Median : 0.0000
 1325095:   90                       Mean   : 0.9983
 1365806:   90                       3rd Qu.: 0.0000
 1411092:   90                       Max.   :45.0000
 (Other):56343
> str(subj <- unique(subset(drink.df, select = c(id, gender))))
'data.frame':	990 obs. of  2 variables:
 $ id    : Factor w/ 990 levels "1014503","1017782",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ gender: Factor w/ 3 levels "0","1","2": 1 2 1 1 2 2 1 1 1 2 ...
> summary(subj)
       id      gender
 1014503:  1   0:549
 1017782:  1   1:439
 1018311:  1   2:  2
 1026066:  1
 1028078:  1
 1036230:  1
 (Other):984

On Sat, Jun 19, 2010 at 10:42 AM, David Atkins <datkins at u.washington.edu> wrote:
>
> Hi all--
>
> I use (g)lmer and MCMCglmm on a weekly basis, and I am wondering about
> options for speeding up their computations.  This is primarily an issue with
> MCMCglmm, given the many necessary MCMC iterations to get to convergence on
> some problems.  But, even with glmer(), I have runs that get into 20-30
> minutes.
>
> First, let me be very clear that this is in no way a criticism of Doug's and
> Jarrod's work (package developers for lme4 and MCMCglmm, respectively).
>  Their code has probably brought models/data into range that would not have
> been possible.
>
> Second, I have included link to data and script below, along with some
> timings on my computer: Mac Book Pro, 2.5GHz, with 4GB RAM.  Here's
> sessionInfo from my runs:
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] MCMCglmm_2.05      corpcor_1.5.6      ape_2.5-3
> [4] coda_0.13-5        tensorA_0.35       lme4_0.999375-34
> [7] Matrix_0.999375-41 lattice_0.19-7
>
> loaded via a namespace (and not attached):
> [1] gee_4.13-14   grid_2.11.1   nlme_3.1-96   stats4_2.11.1 tools_2.11.1
>
> Specific questions:
>
> 1. Would be curious to know timings on other people's set-ups.  Jarrod and I
> had an exchange one time where he was gracious enough to run a zero-inflated
> model where I was concerned about convergence.  He ran a model with 1.3M
> iterations, which I think would take a number of days, if not a week on my
> computer.  This was part of what got me thinking about this.  Thus, my first
> interest is whether there is an "optimal" hardware/OS configuration, or does
> it matter?
>
> Some things I see in R-help archives:
>
> 2. 32 vs. 64-bit: Seems like this is mostly an issue of data/model size and
> whether you need to access more than 4GB of RAM.  AFAICS, 64-bit processors
> are not necessarily faster.
>
> 3. "Optimized" BLAS: There's a bit of discussion about optimized BLAS (basis
> linear algebra... something).  However, these discussions note that there is
> no generally superior BLAS.  Not sure whether specific BLAS might be
> optimized for GLMM computations.
>
> 4. Parallel computing: With multi-core computers, looks like there are some
> avenues for splitting intensive computations across processors.
>
> Finally, I'm asking here b/c I run into these issues with GLMM (and
> zero-inflated mixed models), though most of the discussion I've seen thus
> far about computation speed has been on R-help.
>
> The data below are self-reported drinks (alcohol) from college students for
> up to the last 90 days.  Distribution of counts is zero-inflated.  I run a
> Poisson GLMM with glmer, over-dispersed Poisson GLMM with MCMCglmm, and then
> zero-inflated OD Poisson with MCMCglmm and provide timings for my set-up.
>
> Okay, any and all thoughts welcomed.
>
> thanks, Dave
>
>
> ### thinking through computational speed with lmer and MCMCglmm
> #
> ### read drinking data
> drink.df <- read.table(file =
> "http://depts.washington.edu/cshrb/newweb/stats%20documents/tlfb.txt",
>                                                header = TRUE, sep = "\t")
> str(drink.df) # 57K rows
>
> ### id is id variable (shocking)
> ### gender and weekday are 0/1 indicator variables
> ### drinks has number of drinks consumed
>
> ### distribution of outcome
> table(drink.df$drinks)
> plot(table(drink.df$drinks), lwd=2) # zero-inflated
>
> ### how many people?
> length(unique(drink.df$id)) # 990
> sort(table(drink.df$id))
>
> ### NOTE: most people have max of 90, which they should
> ###                     two folks with 180 and 435 (prob data errors)
> ###                     long negative tail down from 90
> #
> ### NOTE: for this purpose, not worrying about the 180/435
>
> ### speed tests
> #
> ### fit random intercept and random slope for weekday
> ### fixed effects for gender and weekday and interaction
> #
> ### Poisson GLMM with glmer()
> library(lme4)
>
> system.time(
> drk.glmer <- glmer(drinks ~ weekday*gender + (weekday | id),
>                                        data = drink.df, family = poisson,
>                                        verbose = TRUE)
> )
> summary(drk.glmer)
>
> ### timing
> #
> ###    user  system elapsed
> ###  36.326   9.013  45.316
>
> ### over-dispersed Poisson GLMM with MCMCglmm()
> library(MCMCglmm)
>
> prior <- list(R = list(V = 1, n = 1),
>                                G = list(G1 = list(V = diag(2), n = 2)))
> system.time(
> drk.mcmc <- MCMCglmm(drinks ~ weekday*gender,
>                                        random = ~ us(1 + weekday):id,
>                                        data = drink.df, family = "poisson",
>                                        prior = prior)
> )
> summary(drk.mcmc) # NOTE: using summary.MCMCglmm in v 2.05 of package
>
> ### timing
> #
> ###        user   system  elapsed
> ###     1034.317  165.128 1203.536
>
> ### zero-inflated, over-dispersed Poisson GLMM with MCMCglmm()
> #
> ### NOTE: haven't run the following yet, other than a quick "toy run" to
> ###                     sure that it is set up correctly.
> ### NOTE: this only has random intercept in each portion of the model
>
> prior2 <- list(R = list(V = diag(2), n = 1, fix = 2),
>                                G = list(G1 = list(V = 1, n = 1),
>                                                  G2 = list(V = 1, n = 1)))
> system.time(
> drk.zimcmc <- MCMCglmm(drinks ~ -1 + trait*weekday*gender,
>                                        random = ~ idh(at.level(trait, 1)):id
> + idh(at.level(trait, 2)):id,
>                                        rcov = ~ idh(trait):units,
>                                        data = drink.df, family =
> "zipoisson",
>                                        prior = prior2)
> )
> summary(drk.zimcmc)
>
> ### timing
> #
> ###      user   system  elapsed
> ###    2105.366  544.881 2640.030
>
> --
> Dave Atkins, PhD
> Research Associate Professor
> Department of Psychiatry and Behavioral Science
> University of Washington
> datkins at u.washington.edu
>
> Center for the Study of Health and Risk Behaviors (CSHRB)
> 1100 NE 45th Street, Suite 300
> Seattle, WA  98105
> 206-616-3879
> http://depts.washington.edu/cshrb/
> (Mon-Wed)
>
> Center for Healthcare Improvement, for Addictions, Mental Illness,
>  Medically Vulnerable Populations (CHAMMP)
> 325 9th Avenue, 2HH-15
> Box 359911
> Seattle, WA 98104?
> 206-897-4210
> http://www.chammp.org
> (Thurs)
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>