[R-sig-ME] Computational speed - MCMCglmm/lmer
David Atkins
datkins at u.washington.edu
Sun Jun 20 02:00:11 CEST 2010
Douglas Bates wrote:
> Thanks for making the data available, Dave. I will be very interested
> in seeing if I can make glmer more effective in fitting this model.
>
> Right now I have a quick question, what are genders 0, 1, 2? I
> checked that if gender is consistent within subjects, which it is, so
> you have 549 subjects of gender 0, 439 of gender 1 and 2 of gender 2.
> If I wanted to label them as "M", "F" and "?" which labels should I
> apply to genders 0 and 1 and what does gender 2 mean?
Whoops, thought this data was a bit cleaner that apparently it is.
For gender, 0 = Female, 1 = Male
(Men drink more, so pretty easy to confirm...)
I would need to double-check on the 2s, though I do believe there is a
category for transgendered on the demographics.
I'll try to get a cleaner dataset up soon.
D
>
>> drink.df <- within(read.delim("http://depts.washington.edu/cshrb/newweb/stats%20documents/tlfb.txt"),
> + {
> + id <- factor(id)
> + gender <- factor(gender)
> + weekday <- factor(weekday, labels = c("N","Y"))
> + })
>> str(drink.df) # 57K rows
> 'data.frame': 57318 obs. of 4 variables:
> $ id : Factor w/ 990 levels "1014503","1017782",..: 1 1 1 2 2 2 2 2 2 2 ...
> $ gender : Factor w/ 3 levels "0","1","2": 1 1 1 2 2 2 2 2 2 2 ...
> $ weekday: Factor w/ 2 levels "N","Y": 2 1 1 2 1 1 2 2 2 2 ...
> $ drinks : int 0 0 5 8 5 10 0 0 0 0 ...
>> summary(drink.df)
> id gender weekday drinks
> 8396125: 435 0:32445 N:16633 Min. : 0.0000
> 6551092: 180 1:24780 Y:40685 1st Qu.: 0.0000
> 1028078: 90 2: 93 Median : 0.0000
> 1325095: 90 Mean : 0.9983
> 1365806: 90 3rd Qu.: 0.0000
> 1411092: 90 Max. :45.0000
> (Other):56343
>> str(subj <- unique(subset(drink.df, select = c(id, gender))))
> 'data.frame': 990 obs. of 2 variables:
> $ id : Factor w/ 990 levels "1014503","1017782",..: 1 2 3 4 5 6 7 8 9 10 ...
> $ gender: Factor w/ 3 levels "0","1","2": 1 2 1 1 2 2 1 1 1 2 ...
>> summary(subj)
> id gender
> 1014503: 1 0:549
> 1017782: 1 1:439
> 1018311: 1 2: 2
> 1026066: 1
> 1028078: 1
> 1036230: 1
> (Other):984
>
>
> On Sat, Jun 19, 2010 at 10:42 AM, David Atkins <datkins at u.washington.edu> wrote:
>> Hi all--
>>
>> I use (g)lmer and MCMCglmm on a weekly basis, and I am wondering about
>> options for speeding up their computations. This is primarily an issue with
>> MCMCglmm, given the many necessary MCMC iterations to get to convergence on
>> some problems. But, even with glmer(), I have runs that get into 20-30
>> minutes.
>>
>> First, let me be very clear that this is in no way a criticism of Doug's and
>> Jarrod's work (package developers for lme4 and MCMCglmm, respectively).
>> Their code has probably brought models/data into range that would not have
>> been possible.
>>
>> Second, I have included link to data and script below, along with some
>> timings on my computer: Mac Book Pro, 2.5GHz, with 4GB RAM. Here's
>> sessionInfo from my runs:
>>
>>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] MCMCglmm_2.05 corpcor_1.5.6 ape_2.5-3
>> [4] coda_0.13-5 tensorA_0.35 lme4_0.999375-34
>> [7] Matrix_0.999375-41 lattice_0.19-7
>>
>> loaded via a namespace (and not attached):
>> [1] gee_4.13-14 grid_2.11.1 nlme_3.1-96 stats4_2.11.1 tools_2.11.1
>>
>> Specific questions:
>>
>> 1. Would be curious to know timings on other people's set-ups. Jarrod and I
>> had an exchange one time where he was gracious enough to run a zero-inflated
>> model where I was concerned about convergence. He ran a model with 1.3M
>> iterations, which I think would take a number of days, if not a week on my
>> computer. This was part of what got me thinking about this. Thus, my first
>> interest is whether there is an "optimal" hardware/OS configuration, or does
>> it matter?
>>
>> Some things I see in R-help archives:
>>
>> 2. 32 vs. 64-bit: Seems like this is mostly an issue of data/model size and
>> whether you need to access more than 4GB of RAM. AFAICS, 64-bit processors
>> are not necessarily faster.
>>
>> 3. "Optimized" BLAS: There's a bit of discussion about optimized BLAS (basis
>> linear algebra... something). However, these discussions note that there is
>> no generally superior BLAS. Not sure whether specific BLAS might be
>> optimized for GLMM computations.
>>
>> 4. Parallel computing: With multi-core computers, looks like there are some
>> avenues for splitting intensive computations across processors.
>>
>> Finally, I'm asking here b/c I run into these issues with GLMM (and
>> zero-inflated mixed models), though most of the discussion I've seen thus
>> far about computation speed has been on R-help.
>>
>> The data below are self-reported drinks (alcohol) from college students for
>> up to the last 90 days. Distribution of counts is zero-inflated. I run a
>> Poisson GLMM with glmer, over-dispersed Poisson GLMM with MCMCglmm, and then
>> zero-inflated OD Poisson with MCMCglmm and provide timings for my set-up.
>>
>> Okay, any and all thoughts welcomed.
>>
>> thanks, Dave
>>
>>
>> ### thinking through computational speed with lmer and MCMCglmm
>> #
>> ### read drinking data
>> drink.df <- read.table(file =
>> "http://depts.washington.edu/cshrb/newweb/stats%20documents/tlfb.txt",
>> header = TRUE, sep = "\t")
>> str(drink.df) # 57K rows
>>
>> ### id is id variable (shocking)
>> ### gender and weekday are 0/1 indicator variables
>> ### drinks has number of drinks consumed
>>
>> ### distribution of outcome
>> table(drink.df$drinks)
>> plot(table(drink.df$drinks), lwd=2) # zero-inflated
>>
>> ### how many people?
>> length(unique(drink.df$id)) # 990
>> sort(table(drink.df$id))
>>
>> ### NOTE: most people have max of 90, which they should
>> ### two folks with 180 and 435 (prob data errors)
>> ### long negative tail down from 90
>> #
>> ### NOTE: for this purpose, not worrying about the 180/435
>>
>> ### speed tests
>> #
>> ### fit random intercept and random slope for weekday
>> ### fixed effects for gender and weekday and interaction
>> #
>> ### Poisson GLMM with glmer()
>> library(lme4)
>>
>> system.time(
>> drk.glmer <- glmer(drinks ~ weekday*gender + (weekday | id),
>> data = drink.df, family = poisson,
>> verbose = TRUE)
>> )
>> summary(drk.glmer)
>>
>> ### timing
>> #
>> ### user system elapsed
>> ### 36.326 9.013 45.316
>>
>> ### over-dispersed Poisson GLMM with MCMCglmm()
>> library(MCMCglmm)
>>
>> prior <- list(R = list(V = 1, n = 1),
>> G = list(G1 = list(V = diag(2), n = 2)))
>> system.time(
>> drk.mcmc <- MCMCglmm(drinks ~ weekday*gender,
>> random = ~ us(1 + weekday):id,
>> data = drink.df, family = "poisson",
>> prior = prior)
>> )
>> summary(drk.mcmc) # NOTE: using summary.MCMCglmm in v 2.05 of package
>>
>> ### timing
>> #
>> ### user system elapsed
>> ### 1034.317 165.128 1203.536
>>
>> ### zero-inflated, over-dispersed Poisson GLMM with MCMCglmm()
>> #
>> ### NOTE: haven't run the following yet, other than a quick "toy run" to
>> ### sure that it is set up correctly.
>> ### NOTE: this only has random intercept in each portion of the model
>>
>> prior2 <- list(R = list(V = diag(2), n = 1, fix = 2),
>> G = list(G1 = list(V = 1, n = 1),
>> G2 = list(V = 1, n = 1)))
>> system.time(
>> drk.zimcmc <- MCMCglmm(drinks ~ -1 + trait*weekday*gender,
>> random = ~ idh(at.level(trait, 1)):id
>> + idh(at.level(trait, 2)):id,
>> rcov = ~ idh(trait):units,
>> data = drink.df, family =
>> "zipoisson",
>> prior = prior2)
>> )
>> summary(drk.zimcmc)
>>
>> ### timing
>> #
>> ### user system elapsed
>> ### 2105.366 544.881 2640.030
>>
>> --
>> Dave Atkins, PhD
>> Research Associate Professor
>> Department of Psychiatry and Behavioral Science
>> University of Washington
>> datkins at u.washington.edu
>>
>> Center for the Study of Health and Risk Behaviors (CSHRB)
>> 1100 NE 45th Street, Suite 300
>> Seattle, WA 98105
>> 206-616-3879
>> http://depts.washington.edu/cshrb/
>> (Mon-Wed)
>>
>> Center for Healthcare Improvement, for Addictions, Mental Illness,
>> Medically Vulnerable Populations (CHAMMP)
>> 325 9th Avenue, 2HH-15
>> Box 359911
>> Seattle, WA 98104?
>> 206-897-4210
>> http://www.chammp.org
>> (Thurs)
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
More information about the R-sig-mixed-models
mailing list