[R-sig-ME] over-dispersed Poisson and glmer / MCMCglmm

Fri Jan 7 18:56:39 CET 2011

Hi all--

Happy new year.  In some ways this is a bit off-topic, but I would be 
interested in any thoughts that folks might have.

Background: Examining covariates of university student's daily drinking 
(gender, weekend vs. weekday, fraternity/sorority member or not).  This 
is a revised version of a dataset that I mentioned a while back on the 
listserv (related to computational speed).  These daily reports 
encompass student's 21st birthdays, so we have some fairly extreme 
drinking (and certainly some inaccuracies of recall; eg, someone 
reported 45 drinks, which would kill all but the most hardened 
alcoholics... though I never cease to be amazed at how much and how 
students drink).

The distribution of the outcome is quite extreme:

     0     1     2     3     4     5     6     7     8     9
18317  1015   926   690   599   568   401   227   319    97

There are 23,992 observations in total, so we have many, many zeroes but 
also over 5,000 non-zeroes.

Obviously, even at this stage we might think of a zero-inflated/altered 
model, but I examined an over-dispersed Poisson GLMM as some folks have 
argued that highly skewed, low mean data can be well-modeled by negative 
binomial regression.

Data and code are below to fit an over-dispersed Poisson GLMM in both 
glmer and MCMCglmm.  Here are some observations:

1. The two approaches lead to substantively different estimates of fixed 
and random effects.

2. Estimates of the marginal distribution of counts from both models are 
surprisingly good.

3. *But*, results from both models show wickedly non-normal distribution 
for the per-observation (over-dispersion) random-effects.

Fiddling with priors and number of iterations does not seem to change 
results for MCMCglmm (at least what I tried).

The problem in #3 goes away if we fit a zero-altered mixed model using 
MCMCglmm.

Thus, this is why I started by saying this is a bit off-topic.  I 
originally thought that perhaps one or the other of glmer and MCMCglmm 
might be having troubles in some way.  Now, I feel pretty confident that 
*both* are.  Thus, pragmatically, we can just move ahead with a 
zero-altered model.

I kick it out here to see if folks have any other input.  Intuitively, 
given the distribution of the outcome, I could understand why an 
over-dispersed Poisson GLMM might have difficulty.  However, I would be 
particularly interested in whether there is something more specific to 
be said.

Finally, per usual, I want to give a shout-out (American slang for 
accolade...) to Jarrod who helped me with some of the diagnosing of what 
was going on.  In particular, see code below from Jarrod for estimating 
units and simulating from MCMCglmm.

Sorry for the wordy email.

cheers, Dave

 > sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] MASS_7.3-9         MCMCglmm_2.10      corpcor_1.5.7      ape_2.6-2 
         coda_0.14-2
[6] tensorA_0.36       lme4_0.999375-37   Matrix_0.999375-47 
lattice_0.19-17

loaded via a namespace (and not attached):
[1] gee_4.13-16   grid_2.12.1   nlme_3.1-97   stats4_2.12.1

glmer fit:

 > print(dks.glmer.3, cor = F)
Generalized linear mixed model fit by the Laplace approximation
Formula: drinks ~ (weekend + gender + fratsor)^2 + (weekend - 1 | id) + 
      (1 | over)
    Data: tlfb.df
    AIC   BIC logLik deviance
  35961 36050 -17970    35939
Random effects:
  Groups Name           Variance Std.Dev. Corr
  over   (Intercept)    12.79661 3.57723
  id     weekendWeekday  0.14557 0.38154
         weekendWeekend  0.80957 0.89976  -0.651
Number of obs: 23992, groups: over, 23992; id, 980

Fixed effects:
                                 Estimate Std. Error z value Pr(>|z|)
(Intercept)                   -4.3684656  0.0805109  -54.26  < 2e-16 ***
weekendWeekend                 1.3314914  0.1254440   10.61  < 2e-16 ***
genderMen                      0.3592023  0.1153260    3.11  0.00184 **
fratsorFratSor                 0.3345262  0.1704030    1.96  0.04963 *
weekendWeekend:genderMen       0.4945154  0.1740287    2.84  0.00449 **
weekendWeekend:fratsorFratSor -0.0008532  0.2062541    0.00  0.99670
genderMen:fratsorFratSor       0.5618560  0.1884112    2.98  0.00286 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

MCMCglmm fit:

 > summary(dks.glmm3.1)

  Iterations = 3001:12991
  Thinning interval  = 10
  Sample size  = 1000

  DIC: 38559.60

  G-structure:  ~us(weekend):id

                    post.mean l-95% CI u-95% CI eff.samp
Weekday:Weekday.id     1.483   1.2583    1.721    334.7
Weekend:Weekday.id     1.130   0.9338    1.326    498.9
Weekday:Weekend.id     1.130   0.9338    1.326    498.9
Weekend:Weekend.id     1.677   1.3889    1.944    486.3

  R-structure:  ~units

       post.mean l-95% CI u-95% CI eff.samp
units      5.21    4.966    5.463    140.8

  Location effects: drinks ~ (weekend + gender + fratsor)^2

                               post.mean l-95% CI u-95% CI eff.samp 
pMCMC
(Intercept)                    -3.28303 -3.45352 -3.13054    353.7 
<0.001 ***
weekendWeekend                  1.55400  1.37543  1.70599    667.5 
<0.001 ***
genderMen                       0.43253  0.22656  0.65524    848.2 
<0.001 ***
fratsorFratSor                  0.57028  0.24066  0.93947   1000.0 
<0.001 ***
weekendWeekend:genderMen        0.20535 -0.02118  0.43877    707.2 
0.090 .
weekendWeekend:fratsorFratSor  -0.27792 -0.55778 -0.01192    846.4 
0.054 .
genderMen:fratsorFratSor        0.51294  0.09726  0.96072   1000.0 
0.016 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R code for setting up and fitting models:

tlfb.df <- 
read.csv("http://depts.washington.edu/cshrb/newweb/stats%20documents/TLFB.Final.csv", 

					header = TRUE)

str(tlfb.df)

### set categorical variables as factors
tlfb.df <- within(tlfb.df, {
     gender <- factor(gender, 0:1, c("Women","Men"))
     weekend <- factor(weekend, 0:1, c("Weekday","Weekend"))
     fratsor <- factor(fratsor, 0:1, c("Not FratSor","FratSor"))
     })

### Number of participants
length(unique(tlfb.df$id)) # 980

### how much data per person?
table(table(tlfb.df$id)) # 719 have all 30 days

### distribution of outcome
plot(table(tlfb.df$drinks), lwd=3, ylab = "Frequency",
         xlab = "Number of Drinks",
         main= "TLFB Data")

### examine means by covariates
with(tlfb.df, tapply(drinks, list(weekend, fratsor, gender),
         mean, na.rm = TRUE))

### GLMM – over-dispersed Poisson
library(lme4)

### generate variable to capture over-dispersion
tlfb.df$over <- 1:nrow(tlfb.df)

### check contrasts
contrasts(tlfb.df$weekend)
contrasts(tlfb.df$gender)
contrasts(tlfb.df$fratsor)

### all 2-way interactions in fixed-effects
### random intercept and weekend (parameterized to be similar to
### MCMCglmm), plus per-observation random-effect
dks.glmer.3 <- glmer(drinks ~ (weekend + gender + fratsor)^2 +
                     (weekend -1 | id) + (1 | over),
                     data = tlfb.df, verbose = TRUE,
                     family = poisson)
# Comment from Jarrod:
# Using (weekend-1 | id) so that the covariance matrix
# refers to not-weekned and weekned rather than not-weeknd
# and the contrast between weekend and not-weekend.
# Alternativley you could put us(weekend+1 |):id in
# MCMCglmm which is equivalent to (weekend | id) in lmer.

print(dks.glmer.3, cor = F)

library(MCMCglmm)

prior2 <- list(R=list(V=1, nu=0.002),
					G=list(G1=list(V=diag(2), nu=1.002)))
### NOTE: results largely the same over default, 65K and 500K iterations

dks.glmm3.1 <- MCMCglmm(drinks ~ (weekend + gender + fratsor)^2,
	data = tlfb.df, family = "poisson",
	random = ~ us(weekend):id,
	prior = prior2,
	verbose = TRUE, pr = TRUE, pl = TRUE)
summary(dks.glmm3.1)

### traceplots of variance terms – at default iterations not bad
### though as noted above, substantive results don’t change at much
### higher iterations
plot(dks.glmm3.1$VCV)

### how many zeroes (and counts generally) are each predicting?

glmer.fit <- fitted(dks.glmer.3)
mcmc.fit <- exp(colMeans(dks.glmm3.1$Liab))

### get predicted counts up to 20 drinks
pred.tlfb <- matrix(NA, ncol=2, nrow=20)
for (i in 0:19) {
	pred.tlfb[i+1, 1] <- sum(dpois(i, glmer.fit))
	pred.tlfb[i+1, 2] <- sum(dpois(i, mcmc.fit))
	}

cbind(Obs = table(tlfb.df$drinks)[1:20],
				glmer = pred.tlfb[,1],
				mcmc = pred.tlfb[,2])
### NOTE: somewhat surprising how well these models predict
### the marginal distribution of counts, but…
#
### Examine distribution of random-effects for both models

ranef.glmer <- ranef(dks.glmer.3)
str(ranef.glmer)

par(mfrow=c(3,2))
hist(ranef.glmer[[1]][[1]], col = grey(.6))
qqnorm(ranef.glmer[[1]][[1]]) ; qqline(ranef.glmer[[1]][[1]])
hist(ranef.glmer[[2]][[1]], col = grey(.6))
qqnorm(ranef.glmer[[2]][[1]]) ; qqline(ranef.glmer[[2]][[1]])
hist(ranef.glmer[[2]][[2]], col = grey(.6))
qqnorm(ranef.glmer[[2]][[2]]) ; qqline(ranef.glmer[[2]][[2]])

### NOTE: over-dispersion random-effect has complete
### disjoint area and bimodal

### Random-effects in MCMCglmm
#
### both fixed and random are in Sol, fixed first
head(dks.glmm3.1$Sol[,1:10])
dim(dks.glmm3.1$Sol)

### NOTE: first 7 columns are fixed-effects, then next
### 980 are weekday estimates, then following 980 are
### weekend estimates
ranef.mcmc1 <- colMeans(dks.glmm3.1$Sol[,8:987])
ranef.mcmc2 <- colMeans(dks.glmm3.1$Sol[,988:1967])

par(mfrow=c(2,2))
hist(ranef.mcmc1)
qqnorm(ranef.mcmc1) ; qqline(ranef.mcmc1)
hist(ranef.mcmc2)
qqnorm(ranef.mcmc2) ; qqline(ranef.mcmc2)

### Look pretty reasonable
#
### Per Jarrod, observation level random-effects not
### stored, but "the latent variable minus the
### predictions on the link scale (i.e. type="terms")
### should be equivalent to the observation-level
### random effects in glmer."

ranef.mcmc3 <- colMeans(dks.glmm3.1$Liab) - predict(dks.glmm3.1, 
marginal = NULL, type = "terms")

hist(ranef.mcmc3)
qqnorm(ranef.mcmc3) ; qqline(ranef.mcmc3)

### NOTE: Much like glmer, the distribution of “units” is totally
### non-normal

### From Jarrod, comparing that models are similar between
### glmer and MCMCglmm

nind <- length(unique(tlfb.df$id))
# number of individuals

nobs <- dim(tlfb.df)[1]
# number of observations

any((dks.glmer.3 at X-dks.glmm3.1$X)@x!=0)
# fixed effect design matrix is the same

any((t(dks.glmer.3 at Zt)[,1:nobs]-Diagonal(nobs))@x!=0)
# first nobs columns are the observation-level random effects

any((t(dks.glmer.3 at Zt)[,nobs+1:(nind*2)]-dks.glmm3.1$Z)@x!=0)
# random effect design matrix is the same

# so the models are the same.

# Generate some new data according to the posterior mean parameter 
estimates from MCMCglmm:

library(MASS)

nind <- dim(dks.glmm3.1$Z)[2]/2
# number of individuals

W <- cBind(dks.glmm3.1$X, dks.glmm3.1$Z)
# design matrix

beta <- colMeans(dks.glmm3.1$Sol[,1:7])
# fixed effects

V.id <- matrix(colMeans(dks.glmm3.1$VCV[,1:4]),2,2)
# covariance matrix of id random effects

V.obs <- mean(dks.glmm3.1$VCV[,5])

u <- c(t(mvrnorm(dim(dks.glmm3.1$Z)[2]/2, c(0,0), V.id)))
#simulate some new random effects

theta <- c(beta,u)

tlfb.df$sim_drinks <- rpois(dim(W)[1],
									exp((W%*%theta)@x + rnorm(dim(W)[1], 0, sqrt(V.obs))))

### compare simulated to observed drinks
table(tlfb.df$drinks)
table(tlfb.df$sim_drinks)

### NOTE: huge tail in simulated data that extends far beyond observed
### max of 45

-- 
Dave Atkins, PhD
Research Associate Professor
Department of Psychiatry and Behavioral Science
University of Washington
datkins at u.washington.edu

Center for the Study of Health and Risk Behaviors (CSHRB)		
1100 NE 45th Street, Suite 300 	
Seattle, WA  98105 	
206-616-3879 	
http://depts.washington.edu/cshrb/
(Mon-Wed)	

Center for Healthcare Improvement, for Addictions, Mental Illness,
   Medically Vulnerable Populations (CHAMMP)
325 9th Avenue, 2HH-15
Box 359911
Seattle, WA 98104
http://www.chammp.org
(Thurs)