[R-sig-ME] Computational speed - MCMCglmm/lmer
Douglas Bates
bates at stat.wisc.edu
Sat Jun 19 18:50:24 CEST 2010
Thanks for making the data available, Dave. I will be very interested
in seeing if I can make glmer more effective in fitting this model.
Right now I have a quick question, what are genders 0, 1, 2? I
checked that if gender is consistent within subjects, which it is, so
you have 549 subjects of gender 0, 439 of gender 1 and 2 of gender 2.
If I wanted to label them as "M", "F" and "?" which labels should I
apply to genders 0 and 1 and what does gender 2 mean?
> drink.df <- within(read.delim("http://depts.washington.edu/cshrb/newweb/stats%20documents/tlfb.txt"),
+ {
+ id <- factor(id)
+ gender <- factor(gender)
+ weekday <- factor(weekday, labels = c("N","Y"))
+ })
> str(drink.df) # 57K rows
'data.frame': 57318 obs. of 4 variables:
$ id : Factor w/ 990 levels "1014503","1017782",..: 1 1 1 2 2 2 2 2 2 2 ...
$ gender : Factor w/ 3 levels "0","1","2": 1 1 1 2 2 2 2 2 2 2 ...
$ weekday: Factor w/ 2 levels "N","Y": 2 1 1 2 1 1 2 2 2 2 ...
$ drinks : int 0 0 5 8 5 10 0 0 0 0 ...
> summary(drink.df)
id gender weekday drinks
8396125: 435 0:32445 N:16633 Min. : 0.0000
6551092: 180 1:24780 Y:40685 1st Qu.: 0.0000
1028078: 90 2: 93 Median : 0.0000
1325095: 90 Mean : 0.9983
1365806: 90 3rd Qu.: 0.0000
1411092: 90 Max. :45.0000
(Other):56343
> str(subj <- unique(subset(drink.df, select = c(id, gender))))
'data.frame': 990 obs. of 2 variables:
$ id : Factor w/ 990 levels "1014503","1017782",..: 1 2 3 4 5 6 7 8 9 10 ...
$ gender: Factor w/ 3 levels "0","1","2": 1 2 1 1 2 2 1 1 1 2 ...
> summary(subj)
id gender
1014503: 1 0:549
1017782: 1 1:439
1018311: 1 2: 2
1026066: 1
1028078: 1
1036230: 1
(Other):984
On Sat, Jun 19, 2010 at 10:42 AM, David Atkins <datkins at u.washington.edu> wrote:
>
> Hi all--
>
> I use (g)lmer and MCMCglmm on a weekly basis, and I am wondering about
> options for speeding up their computations. This is primarily an issue with
> MCMCglmm, given the many necessary MCMC iterations to get to convergence on
> some problems. But, even with glmer(), I have runs that get into 20-30
> minutes.
>
> First, let me be very clear that this is in no way a criticism of Doug's and
> Jarrod's work (package developers for lme4 and MCMCglmm, respectively).
> Their code has probably brought models/data into range that would not have
> been possible.
>
> Second, I have included link to data and script below, along with some
> timings on my computer: Mac Book Pro, 2.5GHz, with 4GB RAM. Here's
> sessionInfo from my runs:
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] MCMCglmm_2.05 corpcor_1.5.6 ape_2.5-3
> [4] coda_0.13-5 tensorA_0.35 lme4_0.999375-34
> [7] Matrix_0.999375-41 lattice_0.19-7
>
> loaded via a namespace (and not attached):
> [1] gee_4.13-14 grid_2.11.1 nlme_3.1-96 stats4_2.11.1 tools_2.11.1
>
> Specific questions:
>
> 1. Would be curious to know timings on other people's set-ups. Jarrod and I
> had an exchange one time where he was gracious enough to run a zero-inflated
> model where I was concerned about convergence. He ran a model with 1.3M
> iterations, which I think would take a number of days, if not a week on my
> computer. This was part of what got me thinking about this. Thus, my first
> interest is whether there is an "optimal" hardware/OS configuration, or does
> it matter?
>
> Some things I see in R-help archives:
>
> 2. 32 vs. 64-bit: Seems like this is mostly an issue of data/model size and
> whether you need to access more than 4GB of RAM. AFAICS, 64-bit processors
> are not necessarily faster.
>
> 3. "Optimized" BLAS: There's a bit of discussion about optimized BLAS (basis
> linear algebra... something). However, these discussions note that there is
> no generally superior BLAS. Not sure whether specific BLAS might be
> optimized for GLMM computations.
>
> 4. Parallel computing: With multi-core computers, looks like there are some
> avenues for splitting intensive computations across processors.
>
> Finally, I'm asking here b/c I run into these issues with GLMM (and
> zero-inflated mixed models), though most of the discussion I've seen thus
> far about computation speed has been on R-help.
>
> The data below are self-reported drinks (alcohol) from college students for
> up to the last 90 days. Distribution of counts is zero-inflated. I run a
> Poisson GLMM with glmer, over-dispersed Poisson GLMM with MCMCglmm, and then
> zero-inflated OD Poisson with MCMCglmm and provide timings for my set-up.
>
> Okay, any and all thoughts welcomed.
>
> thanks, Dave
>
>
> ### thinking through computational speed with lmer and MCMCglmm
> #
> ### read drinking data
> drink.df <- read.table(file =
> "http://depts.washington.edu/cshrb/newweb/stats%20documents/tlfb.txt",
> header = TRUE, sep = "\t")
> str(drink.df) # 57K rows
>
> ### id is id variable (shocking)
> ### gender and weekday are 0/1 indicator variables
> ### drinks has number of drinks consumed
>
> ### distribution of outcome
> table(drink.df$drinks)
> plot(table(drink.df$drinks), lwd=2) # zero-inflated
>
> ### how many people?
> length(unique(drink.df$id)) # 990
> sort(table(drink.df$id))
>
> ### NOTE: most people have max of 90, which they should
> ### two folks with 180 and 435 (prob data errors)
> ### long negative tail down from 90
> #
> ### NOTE: for this purpose, not worrying about the 180/435
>
> ### speed tests
> #
> ### fit random intercept and random slope for weekday
> ### fixed effects for gender and weekday and interaction
> #
> ### Poisson GLMM with glmer()
> library(lme4)
>
> system.time(
> drk.glmer <- glmer(drinks ~ weekday*gender + (weekday | id),
> data = drink.df, family = poisson,
> verbose = TRUE)
> )
> summary(drk.glmer)
>
> ### timing
> #
> ### user system elapsed
> ### 36.326 9.013 45.316
>
> ### over-dispersed Poisson GLMM with MCMCglmm()
> library(MCMCglmm)
>
> prior <- list(R = list(V = 1, n = 1),
> G = list(G1 = list(V = diag(2), n = 2)))
> system.time(
> drk.mcmc <- MCMCglmm(drinks ~ weekday*gender,
> random = ~ us(1 + weekday):id,
> data = drink.df, family = "poisson",
> prior = prior)
> )
> summary(drk.mcmc) # NOTE: using summary.MCMCglmm in v 2.05 of package
>
> ### timing
> #
> ### user system elapsed
> ### 1034.317 165.128 1203.536
>
> ### zero-inflated, over-dispersed Poisson GLMM with MCMCglmm()
> #
> ### NOTE: haven't run the following yet, other than a quick "toy run" to
> ### sure that it is set up correctly.
> ### NOTE: this only has random intercept in each portion of the model
>
> prior2 <- list(R = list(V = diag(2), n = 1, fix = 2),
> G = list(G1 = list(V = 1, n = 1),
> G2 = list(V = 1, n = 1)))
> system.time(
> drk.zimcmc <- MCMCglmm(drinks ~ -1 + trait*weekday*gender,
> random = ~ idh(at.level(trait, 1)):id
> + idh(at.level(trait, 2)):id,
> rcov = ~ idh(trait):units,
> data = drink.df, family =
> "zipoisson",
> prior = prior2)
> )
> summary(drk.zimcmc)
>
> ### timing
> #
> ### user system elapsed
> ### 2105.366 544.881 2640.030
>
> --
> Dave Atkins, PhD
> Research Associate Professor
> Department of Psychiatry and Behavioral Science
> University of Washington
> datkins at u.washington.edu
>
> Center for the Study of Health and Risk Behaviors (CSHRB)
> 1100 NE 45th Street, Suite 300
> Seattle, WA 98105
> 206-616-3879
> http://depts.washington.edu/cshrb/
> (Mon-Wed)
>
> Center for Healthcare Improvement, for Addictions, Mental Illness,
> Medically Vulnerable Populations (CHAMMP)
> 325 9th Avenue, 2HH-15
> Box 359911
> Seattle, WA 98104?
> 206-897-4210
> http://www.chammp.org
> (Thurs)
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
More information about the R-sig-mixed-models
mailing list