[R-sig-ME] Data Redux -- Re: Computational speed - MCMCglmm/lmer

Fri Jun 25 18:24:45 CEST 2010

Douglas Bates wrote:
> Thanks for sending the cleaned up version of the data, Dave.
> 
> First, it seems to me that the model should include the fratsor
> variable and perhaps an interaction with gender.  This will make it
> difficult to fit (because it has even more fixed-effects parameters)
> and I would suggest replacing weekday by a factor with three levels
> "Sun-Wed", "Thu" and "Fri-Sat".  See the enclosed plot for why I would
> suggest that.  Alternatively you could consider a binary variable of
> Sun-Wed and Thu-Sat.
> There definitely does seem to be a peak at Friday, especially in the
> Frat/Sor group and a drop-off to Saturday.  Interestingly Thursday
> seems to have heavier consumption by the Frat/Sor group than Saturday
> (I was wondering why attendance seems a little thin for my 8:00 a.m.
> Intro Engineering Statistics class on Fridays).

Agreed; one of the tricky things we regularly wrestle with (with such 
data) is how best to model the week effects.  A colleague and I have 
taken a look at possibly using cyclical terms (inspired, in part, by 
models in P&B).  There clearly is a *cycle* over weeks, it just ain't 
terribly smooth in its rising and falling.

[If anybody has any additional, creative thoughts on this one... I'm all 
ears.  There are a lot of daily alcohol studies, and this is a perennial 
thorn in the side during the analyses...]

The most common thing to find in the alcohol literature is either a 
dummy-code for weekend/weekday (where Thurs "switch hits" depending on 
the paper), or what you suggest: Sun-Wed, Thurs, Fri-Sat.  Letting day 
of week be a fully dummy-coded factor often leads to better fit based on 
model criteria (though, not universally with BIC for obvious reasons), 
but the tricky part is that we almost always want interactions with 
these changes, and interactions with 6 dummy-codes for day of week just 
"blossoms" the model incredibly.

Thanks for the substantive as well as computational thoughts; a few more 
comments from me below.

> Second, there is a good news/bad news situation regarding the
> development version of glmer.  I have introduced a coarser method of
> getting estimates selected with the optional argument nAGQ = 0.  It is
> more complicated to explain than I would like to embark on here but
> suffice to say it is much faster and gets good estimates but not
> the best.  At this point the output will look suspicious because I needed to
> turn off the system.time in the batch run, probably because of a
> memory protection issue that I will need to resolve.

This would actually be pretty attractive to me, as I don't mind 
"cooking" a final model for quite a while either using (g)lmer or 
MCMCglmm, but long running models can be trying when you're doing 
initial runs working toward a final model.

Per usual, many thanks for your expertise, code, and willingness to 
contribute.

cheers, Dave

> 
> Timing this fit works interactively for me but not in a batch run
> 
>> system.time(
> + drk0.glmer <- glmer(drinks ~ weekday*gender + (1 | id),
> +                                        data = drink.df, family = poisson,
> +                                        verbose = TRUE, nAGQ=0L)
> + )
> npt = 3 , n =  1
> rhobeg =  0.2 , rhoend =  2e-07
>   0.020:   3:      146545.; 1.00000
>  0.0020:   3:      146545.; 1.00000
>  0.00020:  10:      146543.;0.960752
>  2.0e-05:  12:      146543.;0.960984
>  2.0e-06:  15:      146543.;0.960974
>  2.0e-07:  15:      146543.;0.960974
> At return
>  17:     146542.78: 0.960974
>   user  system elapsed
>  16.870   0.080  17.029
> 
> So 17 seconds elapsed time is pretty good.  The full Laplacian fit
> took about 300 seconds and gave only a slightly better deviance.
> Regrettably I have seen situations where the full Laplacian fit is
> considerably better so I can't recommend avoiding the full Laplacian
> fit entirely.  I do think that the fit with nAGQ = 0 is okay for coarse model
> building where you need rough comparisons of several different models.
> 
> 
> On Wed, Jun 23, 2010 at 1:40 PM, David Atkins <datkins at u.washington.edu> wrote:
>> First, my apologies for not providing clean data in my initial posting,
>> which was clearly an oversight on my part.  I have an email in to the PI but
>> haven't gotten a response yet.
>>
>> I have attached an updated dataset, that pulls out the 2 suspect
>> participants as well as those with gender coded 2.  In addition, it includes
>> two more variables:
>>
>> weekday - 7 level factor for day of week (new variable weekend now has the
>> binary version)
>> fratsor - a binary factor indicating whether individual is part of a
>> fraternity or sorority (or not)
>>
>> I have an updated script below that shows (descriptively) how drinking
>> varies over these factors.  Moreover, including weekday as either fixed or
>> random, really ratchets up the time for glmer.  Again, my original purpose
>> is definitely *not* to criticize lme4 (or Doug), but simply to get some
>> feedback about computational speed, and whether or not it's possible to
>> speed up runs via hardware, OS, or software.
>>
>> Finally, in providing some data, I was trying to provide a brief subset of
>> data that has certain realistic qualities (and it is real data) but is not a
>> "full" data set.  In trying to be simple and straightforward, I may have
>> raised more questions.
>>
>> So, in response to a few questions raised:
>>
>> The data come from a study of problematic drinking focused on high-risk
>> events, and this particular data focused on 21st birthdays (here in the US,
>> that is the legal drinking age).  The present data are from a survey (ie,
>> non-intervention), which then informed a later intervention study.  Thus,
>> the (up to) 90 days of retrospective drinking, will cover the individual's
>> 21st birthday.
>>
>> Doug asks about someone reporting 45 drinks.  Because we are capturing 21st
>> birthdays, we are definitely getting some extreme drinking, but we also are
>> relying on self-report.  Clearly, there will be some errors and possible
>> bravado about feats of drinking.  Moreover, the study has a protocol for
>> addressing potentially dangerous levels of drinking with participants (when
>> blood-alcohol content passes a certain cutoff).
>>
>> Hadley asks about the repeated data.  Part of the study involves friends of
>> the main participant (as one intervention condition includes friends input
>> as part of the intervention).  Some of the friend's data were included in a
>> larger version of this dataset, which then led to replicates of the
>> participant's data.  Thus, we knew individuals with friend's data had
>> replicates, but this version was supposed to have that cleaned up.
>>
>> Finally, this particular data was collected on a rolling basis, and thus the
>> 90 day recounting of drinking covers a large swath of the calendar year.  We
>> are currently working on a paper that examines student drinking on typical
>> weekdays vs. weekends vs. 21st birthday vs. holidays (including St.
>> Patrick's Day, though July 4th and New Year's Eve are much heavier drinking
>> days...).
>>
>> Hopefully this clarifies certain aspects of the data, and I definitely
>> appreciate folks input on the computational aspects -- I'm certainly
>> learning.
>>
>> [BTW, I'm currently fitting a second model to the one reported below, with a
>> random effect for weekday -- currently at 30 minutes and counting.]