[R-sig-ME] Data Redux -- Re: Computational speed - MCMCglmm/lmer

Wed Jun 23 20:40:53 CEST 2010

First, my apologies for not providing clean data in my initial posting, 
which was clearly an oversight on my part.  I have an email in to the PI 
but haven't gotten a response yet.

I have attached an updated dataset, that pulls out the 2 suspect 
participants as well as those with gender coded 2.  In addition, it 
includes two more variables:

weekday - 7 level factor for day of week (new variable weekend now has 
the binary version)
fratsor - a binary factor indicating whether individual is part of a 
fraternity or sorority (or not)

I have an updated script below that shows (descriptively) how drinking 
varies over these factors.  Moreover, including weekday as either fixed 
or random, really ratchets up the time for glmer.  Again, my original 
purpose is definitely *not* to criticize lme4 (or Doug), but simply to 
get some feedback about computational speed, and whether or not it's 
possible to speed up runs via hardware, OS, or software.

Finally, in providing some data, I was trying to provide a brief subset 
of data that has certain realistic qualities (and it is real data) but 
is not a "full" data set.  In trying to be simple and straightforward, I 
may have raised more questions.

So, in response to a few questions raised:

The data come from a study of problematic drinking focused on high-risk 
events, and this particular data focused on 21st birthdays (here in the 
US, that is the legal drinking age).  The present data are from a survey 
(ie, non-intervention), which then informed a later intervention study. 
  Thus, the (up to) 90 days of retrospective drinking, will cover the 
individual's 21st birthday.

Doug asks about someone reporting 45 drinks.  Because we are capturing 
21st birthdays, we are definitely getting some extreme drinking, but we 
also are relying on self-report.  Clearly, there will be some errors and 
possible bravado about feats of drinking.  Moreover, the study has a 
protocol for addressing potentially dangerous levels of drinking with 
participants (when blood-alcohol content passes a certain cutoff).

Hadley asks about the repeated data.  Part of the study involves friends 
of the main participant (as one intervention condition includes friends 
input as part of the intervention).  Some of the friend's data were 
included in a larger version of this dataset, which then led to 
replicates of the participant's data.  Thus, we knew individuals with 
friend's data had replicates, but this version was supposed to have that 
cleaned up.

Finally, this particular data was collected on a rolling basis, and thus 
the 90 day recounting of drinking covers a large swath of the calendar 
year.  We are currently working on a paper that examines student 
drinking on typical weekdays vs. weekends vs. 21st birthday vs. holidays 
(including St. Patrick's Day, though July 4th and New Year's Eve are 
much heavier drinking days...).

Hopefully this clarifies certain aspects of the data, and I definitely 
appreciate folks input on the computational aspects -- I'm certainly 
learning.

[BTW, I'm currently fitting a second model to the one reported below, 
with a random effect for weekday -- currently at 30 minutes and counting.]

cheers, Dave

### updated 6.23.10 -- thinking through computational speed with lmer 
and MCMCglmm
#
### read in *updated* drinking data
drink.df <- read.table("drinksUPDATED.txt",
						header = TRUE, sep = "\t")
str(drink.df) # 57K rows

### NOTE: slight variable changes since last iteration
### id is id variable (shocking)
### gender is binary factor
### fratsor is binary factor (in fraternity or sorority vs. not)
### weekday now has day of week
### weekend is now weekday (Sun - Thurs) vs. weekend (Fri - Sat)
### drinks has number of drinks consumed

### trim trailing white spaces
levels(drink.df$weekday) <- sub('[[:space:]]+$', '', 
levels(drink.df$weekday))

### re-order factor
library(gdata)
drink.df$weekday <- reorder.factor(drink.df$weekday, new.order = 
c(4,2,6,7,5,1,3))
levels(drink.df$weekday)

### distribution of outcome
table(drink.df$drinks)
plot(table(drink.df$drinks), lwd=2) # zero-inflated

### how many people?
length(unique(drink.df$id)) # 980
sort(table(drink.df$id)) # max of 90 drinks

### how do drinks vary across days of week and also gender and fratsor?
#
with(drink.df, tapply(drinks, list(gender, fratsor), mean))
### not surprising, but men in fraternities drink notably more

with(drink.df, tapply(drinks, list(gender, weekday, fratsor), mean))
### general peak on friday and/or saturday, but differences in rest
### of week for fraternity/sorority crowd...

### speed tests
#
### fit random intercept and random slope for weekday
### fixed effects for gender and weekday and interaction
#
### Poisson GLMM with glmer()
library(lme4)

### NOTE: weekday is different from last set of code; last time it was
###			binary, this time it is 7 level factor
#
### random intercept
system.time(
drk.glmer <- glmer(drinks ~ weekday*gender + (1 | id),
					data = drink.df, family = poisson,
					verbose = TRUE)
)
summary(drk.glmer)

### timing
#
###    user  system elapsed
### 257.268  39.115 295.754

Hadley Wickham wrote:
>> Forgive me for sounding grouchy but I find this whole discussion
>> misguided.  Worrying about the speed of fitting a model and niceties
>> of the model formulation before doing elementary checks on the data is
>> putting the cart before the horse.  Why is gender coded as 0, 1 and 2?
>>  Why, when there was a maximum of 90 days of monitoring, is there one
>> id with 435 observations and another with 180 observations.
> 
> Especially when you look at (e.g.) subset(drink.df, id == 8396125) and
> notice that drinks appear to be repeated in blocks of five.  Looks
> like a data processing error - but has it also happened for other
> subject?
> 
> I also wonder why the date has been removed from the data. It seems
> like this would be an important covariate (St Patrick's day, match
> days, birthdays, ...).  And I hope weekday isn't Monday-Friday...
> 
> Hadley
> 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: drinksUPDATED.txt
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20100623/f0ef694b/attachment.txt>