[R-sig-ME] Computational speed - MCMCglmm/lmer

Sun Jun 20 02:00:11 CEST 2010

Douglas Bates wrote:
> Thanks for making the data available, Dave.  I will be very interested
> in seeing if I can make glmer more effective in fitting this model.
> 
> Right now I have a quick question, what are genders 0, 1, 2?  I
> checked that if gender is consistent within subjects, which it is, so
> you have 549 subjects of gender 0, 439 of gender 1 and 2 of gender 2.
> If I wanted to label them as "M", "F" and "?" which labels should I
> apply to genders 0 and 1 and what does gender 2 mean?

Whoops, thought this data was a bit cleaner that apparently it is.

For gender, 0 = Female, 1 = Male

(Men drink more, so pretty easy to confirm...)

I would need to double-check on the 2s, though I do believe there is a 
category for transgendered on the demographics.

I'll try to get a cleaner dataset up soon.

D

> 
>> drink.df <- within(read.delim("http://depts.washington.edu/cshrb/newweb/stats%20documents/tlfb.txt"),
> +                    {
> +                        id      <- factor(id)
> +                        gender  <- factor(gender)
> +                        weekday <- factor(weekday, labels = c("N","Y"))
> +                    })
>> str(drink.df) # 57K rows
> 'data.frame':	57318 obs. of  4 variables:
>  $ id     : Factor w/ 990 levels "1014503","1017782",..: 1 1 1 2 2 2 2 2 2 2 ...
>  $ gender : Factor w/ 3 levels "0","1","2": 1 1 1 2 2 2 2 2 2 2 ...
>  $ weekday: Factor w/ 2 levels "N","Y": 2 1 1 2 1 1 2 2 2 2 ...
>  $ drinks : int  0 0 5 8 5 10 0 0 0 0 ...
>> summary(drink.df)
>        id        gender    weekday       drinks
>  8396125:  435   0:32445   N:16633   Min.   : 0.0000
>  6551092:  180   1:24780   Y:40685   1st Qu.: 0.0000
>  1028078:   90   2:   93             Median : 0.0000
>  1325095:   90                       Mean   : 0.9983
>  1365806:   90                       3rd Qu.: 0.0000
>  1411092:   90                       Max.   :45.0000
>  (Other):56343
>> str(subj <- unique(subset(drink.df, select = c(id, gender))))
> 'data.frame':	990 obs. of  2 variables:
>  $ id    : Factor w/ 990 levels "1014503","1017782",..: 1 2 3 4 5 6 7 8 9 10 ...
>  $ gender: Factor w/ 3 levels "0","1","2": 1 2 1 1 2 2 1 1 1 2 ...
>> summary(subj)
>        id      gender
>  1014503:  1   0:549
>  1017782:  1   1:439
>  1018311:  1   2:  2
>  1026066:  1
>  1028078:  1
>  1036230:  1
>  (Other):984
> 
> 
> On Sat, Jun 19, 2010 at 10:42 AM, David Atkins <datkins at u.washington.edu> wrote:
>> Hi all--
>>
>> I use (g)lmer and MCMCglmm on a weekly basis, and I am wondering about
>> options for speeding up their computations.  This is primarily an issue with
>> MCMCglmm, given the many necessary MCMC iterations to get to convergence on
>> some problems.  But, even with glmer(), I have runs that get into 20-30
>> minutes.
>>
>> First, let me be very clear that this is in no way a criticism of Doug's and
>> Jarrod's work (package developers for lme4 and MCMCglmm, respectively).
>>  Their code has probably brought models/data into range that would not have
>> been possible.
>>
>> Second, I have included link to data and script below, along with some
>> timings on my computer: Mac Book Pro, 2.5GHz, with 4GB RAM.  Here's
>> sessionInfo from my runs:
>>
>>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] MCMCglmm_2.05      corpcor_1.5.6      ape_2.5-3
>> [4] coda_0.13-5        tensorA_0.35       lme4_0.999375-34
>> [7] Matrix_0.999375-41 lattice_0.19-7
>>
>> loaded via a namespace (and not attached):
>> [1] gee_4.13-14   grid_2.11.1   nlme_3.1-96   stats4_2.11.1 tools_2.11.1
>>
>> Specific questions:
>>
>> 1. Would be curious to know timings on other people's set-ups.  Jarrod and I
>> had an exchange one time where he was gracious enough to run a zero-inflated
>> model where I was concerned about convergence.  He ran a model with 1.3M
>> iterations, which I think would take a number of days, if not a week on my
>> computer.  This was part of what got me thinking about this.  Thus, my first
>> interest is whether there is an "optimal" hardware/OS configuration, or does
>> it matter?
>>
>> Some things I see in R-help archives:
>>
>> 2. 32 vs. 64-bit: Seems like this is mostly an issue of data/model size and
>> whether you need to access more than 4GB of RAM.  AFAICS, 64-bit processors
>> are not necessarily faster.
>>
>> 3. "Optimized" BLAS: There's a bit of discussion about optimized BLAS (basis
>> linear algebra... something).  However, these discussions note that there is
>> no generally superior BLAS.  Not sure whether specific BLAS might be
>> optimized for GLMM computations.
>>
>> 4. Parallel computing: With multi-core computers, looks like there are some
>> avenues for splitting intensive computations across processors.
>>
>> Finally, I'm asking here b/c I run into these issues with GLMM (and
>> zero-inflated mixed models), though most of the discussion I've seen thus
>> far about computation speed has been on R-help.
>>
>> The data below are self-reported drinks (alcohol) from college students for
>> up to the last 90 days.  Distribution of counts is zero-inflated.  I run a
>> Poisson GLMM with glmer, over-dispersed Poisson GLMM with MCMCglmm, and then
>> zero-inflated OD Poisson with MCMCglmm and provide timings for my set-up.
>>
>> Okay, any and all thoughts welcomed.
>>
>> thanks, Dave
>>
>>
>> ### thinking through computational speed with lmer and MCMCglmm
>> #
>> ### read drinking data
>> drink.df <- read.table(file =
>> "http://depts.washington.edu/cshrb/newweb/stats%20documents/tlfb.txt",
>>                                                header = TRUE, sep = "\t")
>> str(drink.df) # 57K rows
>>
>> ### id is id variable (shocking)
>> ### gender and weekday are 0/1 indicator variables
>> ### drinks has number of drinks consumed
>>
>> ### distribution of outcome
>> table(drink.df$drinks)
>> plot(table(drink.df$drinks), lwd=2) # zero-inflated
>>
>> ### how many people?
>> length(unique(drink.df$id)) # 990
>> sort(table(drink.df$id))
>>
>> ### NOTE: most people have max of 90, which they should
>> ###                     two folks with 180 and 435 (prob data errors)
>> ###                     long negative tail down from 90
>> #
>> ### NOTE: for this purpose, not worrying about the 180/435
>>
>> ### speed tests
>> #
>> ### fit random intercept and random slope for weekday
>> ### fixed effects for gender and weekday and interaction
>> #
>> ### Poisson GLMM with glmer()
>> library(lme4)
>>
>> system.time(
>> drk.glmer <- glmer(drinks ~ weekday*gender + (weekday | id),
>>                                        data = drink.df, family = poisson,
>>                                        verbose = TRUE)
>> )
>> summary(drk.glmer)
>>
>> ### timing
>> #
>> ###    user  system elapsed
>> ###  36.326   9.013  45.316
>>
>> ### over-dispersed Poisson GLMM with MCMCglmm()
>> library(MCMCglmm)
>>
>> prior <- list(R = list(V = 1, n = 1),
>>                                G = list(G1 = list(V = diag(2), n = 2)))
>> system.time(
>> drk.mcmc <- MCMCglmm(drinks ~ weekday*gender,
>>                                        random = ~ us(1 + weekday):id,
>>                                        data = drink.df, family = "poisson",
>>                                        prior = prior)
>> )
>> summary(drk.mcmc) # NOTE: using summary.MCMCglmm in v 2.05 of package
>>
>> ### timing
>> #
>> ###        user   system  elapsed
>> ###     1034.317  165.128 1203.536
>>
>> ### zero-inflated, over-dispersed Poisson GLMM with MCMCglmm()
>> #
>> ### NOTE: haven't run the following yet, other than a quick "toy run" to
>> ###                     sure that it is set up correctly.
>> ### NOTE: this only has random intercept in each portion of the model
>>
>> prior2 <- list(R = list(V = diag(2), n = 1, fix = 2),
>>                                G = list(G1 = list(V = 1, n = 1),
>>                                                  G2 = list(V = 1, n = 1)))
>> system.time(
>> drk.zimcmc <- MCMCglmm(drinks ~ -1 + trait*weekday*gender,
>>                                        random = ~ idh(at.level(trait, 1)):id
>> + idh(at.level(trait, 2)):id,
>>                                        rcov = ~ idh(trait):units,
>>                                        data = drink.df, family =
>> "zipoisson",
>>                                        prior = prior2)
>> )
>> summary(drk.zimcmc)
>>
>> ### timing
>> #
>> ###      user   system  elapsed
>> ###    2105.366  544.881 2640.030
>>
>> --
>> Dave Atkins, PhD
>> Research Associate Professor
>> Department of Psychiatry and Behavioral Science
>> University of Washington
>> datkins at u.washington.edu
>>
>> Center for the Study of Health and Risk Behaviors (CSHRB)
>> 1100 NE 45th Street, Suite 300
>> Seattle, WA  98105
>> 206-616-3879
>> http://depts.washington.edu/cshrb/
>> (Mon-Wed)
>>
>> Center for Healthcare Improvement, for Addictions, Mental Illness,
>>  Medically Vulnerable Populations (CHAMMP)
>> 325 9th Avenue, 2HH-15
>> Box 359911
>> Seattle, WA 98104?
>> 206-897-4210
>> http://www.chammp.org
>> (Thurs)
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>