# [R-sig-ME] over-dispersed Poisson and glmer / MCMCglmm

David Atkins datkins at u.washington.edu
Fri Jan 7 18:56:39 CET 2011

```Hi all--

Happy new year.  In some ways this is a bit off-topic, but I would be
interested in any thoughts that folks might have.

Background: Examining covariates of university student's daily drinking
(gender, weekend vs. weekday, fraternity/sorority member or not).  This
is a revised version of a dataset that I mentioned a while back on the
listserv (related to computational speed).  These daily reports
encompass student's 21st birthdays, so we have some fairly extreme
drinking (and certainly some inaccuracies of recall; eg, someone
reported 45 drinks, which would kill all but the most hardened
alcoholics... though I never cease to be amazed at how much and how
students drink).

The distribution of the outcome is quite extreme:

0     1     2     3     4     5     6     7     8     9
18317  1015   926   690   599   568   401   227   319    97

There are 23,992 observations in total, so we have many, many zeroes but
also over 5,000 non-zeroes.

Obviously, even at this stage we might think of a zero-inflated/altered
model, but I examined an over-dispersed Poisson GLMM as some folks have
argued that highly skewed, low mean data can be well-modeled by negative
binomial regression.

Data and code are below to fit an over-dispersed Poisson GLMM in both
glmer and MCMCglmm.  Here are some observations:

1. The two approaches lead to substantively different estimates of fixed
and random effects.

2. Estimates of the marginal distribution of counts from both models are
surprisingly good.

3. *But*, results from both models show wickedly non-normal distribution
for the per-observation (over-dispersion) random-effects.

Fiddling with priors and number of iterations does not seem to change
results for MCMCglmm (at least what I tried).

The problem in #3 goes away if we fit a zero-altered mixed model using
MCMCglmm.

Thus, this is why I started by saying this is a bit off-topic.  I
originally thought that perhaps one or the other of glmer and MCMCglmm
might be having troubles in some way.  Now, I feel pretty confident that
*both* are.  Thus, pragmatically, we can just move ahead with a
zero-altered model.

I kick it out here to see if folks have any other input.  Intuitively,
given the distribution of the outcome, I could understand why an
over-dispersed Poisson GLMM might have difficulty.  However, I would be
particularly interested in whether there is something more specific to
be said.

Finally, per usual, I want to give a shout-out (American slang for
accolade...) to Jarrod who helped me with some of the diagnosing of what
was going on.  In particular, see code below from Jarrod for estimating
units and simulating from MCMCglmm.

Sorry for the wordy email.

cheers, Dave

> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] MASS_7.3-9         MCMCglmm_2.10      corpcor_1.5.7      ape_2.6-2
coda_0.14-2
[6] tensorA_0.36       lme4_0.999375-37   Matrix_0.999375-47
lattice_0.19-17

loaded via a namespace (and not attached):
[1] gee_4.13-16   grid_2.12.1   nlme_3.1-97   stats4_2.12.1

glmer fit:

> print(dks.glmer.3, cor = F)
Generalized linear mixed model fit by the Laplace approximation
Formula: drinks ~ (weekend + gender + fratsor)^2 + (weekend - 1 | id) +
(1 | over)
Data: tlfb.df
AIC   BIC logLik deviance
35961 36050 -17970    35939
Random effects:
Groups Name           Variance Std.Dev. Corr
over   (Intercept)    12.79661 3.57723
id     weekendWeekday  0.14557 0.38154
weekendWeekend  0.80957 0.89976  -0.651
Number of obs: 23992, groups: over, 23992; id, 980

Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                   -4.3684656  0.0805109  -54.26  < 2e-16 ***
weekendWeekend                 1.3314914  0.1254440   10.61  < 2e-16 ***
genderMen                      0.3592023  0.1153260    3.11  0.00184 **
fratsorFratSor                 0.3345262  0.1704030    1.96  0.04963 *
weekendWeekend:genderMen       0.4945154  0.1740287    2.84  0.00449 **
weekendWeekend:fratsorFratSor -0.0008532  0.2062541    0.00  0.99670
genderMen:fratsorFratSor       0.5618560  0.1884112    2.98  0.00286 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

MCMCglmm fit:

> summary(dks.glmm3.1)

Iterations = 3001:12991
Thinning interval  = 10
Sample size  = 1000

DIC: 38559.60

G-structure:  ~us(weekend):id

post.mean l-95% CI u-95% CI eff.samp
Weekday:Weekday.id     1.483   1.2583    1.721    334.7
Weekend:Weekday.id     1.130   0.9338    1.326    498.9
Weekday:Weekend.id     1.130   0.9338    1.326    498.9
Weekend:Weekend.id     1.677   1.3889    1.944    486.3

R-structure:  ~units

post.mean l-95% CI u-95% CI eff.samp
units      5.21    4.966    5.463    140.8

Location effects: drinks ~ (weekend + gender + fratsor)^2

post.mean l-95% CI u-95% CI eff.samp
pMCMC
(Intercept)                    -3.28303 -3.45352 -3.13054    353.7
<0.001 ***
weekendWeekend                  1.55400  1.37543  1.70599    667.5
<0.001 ***
genderMen                       0.43253  0.22656  0.65524    848.2
<0.001 ***
fratsorFratSor                  0.57028  0.24066  0.93947   1000.0
<0.001 ***
weekendWeekend:genderMen        0.20535 -0.02118  0.43877    707.2
0.090 .
weekendWeekend:fratsorFratSor  -0.27792 -0.55778 -0.01192    846.4
0.054 .
genderMen:fratsorFratSor        0.51294  0.09726  0.96072   1000.0
0.016 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R code for setting up and fitting models:

tlfb.df <-

str(tlfb.df)

### set categorical variables as factors
tlfb.df <- within(tlfb.df, {
gender <- factor(gender, 0:1, c("Women","Men"))
weekend <- factor(weekend, 0:1, c("Weekday","Weekend"))
fratsor <- factor(fratsor, 0:1, c("Not FratSor","FratSor"))
})

### Number of participants
length(unique(tlfb.df\$id)) # 980

### how much data per person?
table(table(tlfb.df\$id)) # 719 have all 30 days

### distribution of outcome
plot(table(tlfb.df\$drinks), lwd=3, ylab = "Frequency",
xlab = "Number of Drinks",
main= "TLFB Data")

### examine means by covariates
with(tlfb.df, tapply(drinks, list(weekend, fratsor, gender),
mean, na.rm = TRUE))

### GLMM – over-dispersed Poisson
library(lme4)

### generate variable to capture over-dispersion
tlfb.df\$over <- 1:nrow(tlfb.df)

### check contrasts
contrasts(tlfb.df\$weekend)
contrasts(tlfb.df\$gender)
contrasts(tlfb.df\$fratsor)

### all 2-way interactions in fixed-effects
### random intercept and weekend (parameterized to be similar to
### MCMCglmm), plus per-observation random-effect
dks.glmer.3 <- glmer(drinks ~ (weekend + gender + fratsor)^2 +
(weekend -1 | id) + (1 | over),
data = tlfb.df, verbose = TRUE,
family = poisson)
# Comment from Jarrod:
# Using (weekend-1 | id) so that the covariance matrix
# refers to not-weekned and weekned rather than not-weeknd
# and the contrast between weekend and not-weekend.
# Alternativley you could put us(weekend+1 |):id in
# MCMCglmm which is equivalent to (weekend | id) in lmer.

print(dks.glmer.3, cor = F)

library(MCMCglmm)

prior2 <- list(R=list(V=1, nu=0.002),
G=list(G1=list(V=diag(2), nu=1.002)))
### NOTE: results largely the same over default, 65K and 500K iterations

dks.glmm3.1 <- MCMCglmm(drinks ~ (weekend + gender + fratsor)^2,
data = tlfb.df, family = "poisson",
random = ~ us(weekend):id,
prior = prior2,
verbose = TRUE, pr = TRUE, pl = TRUE)
summary(dks.glmm3.1)

### traceplots of variance terms – at default iterations not bad
### though as noted above, substantive results don’t change at much
### higher iterations
plot(dks.glmm3.1\$VCV)

### how many zeroes (and counts generally) are each predicting?

glmer.fit <- fitted(dks.glmer.3)
mcmc.fit <- exp(colMeans(dks.glmm3.1\$Liab))

### get predicted counts up to 20 drinks
pred.tlfb <- matrix(NA, ncol=2, nrow=20)
for (i in 0:19) {
pred.tlfb[i+1, 1] <- sum(dpois(i, glmer.fit))
pred.tlfb[i+1, 2] <- sum(dpois(i, mcmc.fit))
}

cbind(Obs = table(tlfb.df\$drinks)[1:20],
glmer = pred.tlfb[,1],
mcmc = pred.tlfb[,2])
### NOTE: somewhat surprising how well these models predict
### the marginal distribution of counts, but…
#
### Examine distribution of random-effects for both models

ranef.glmer <- ranef(dks.glmer.3)
str(ranef.glmer)

par(mfrow=c(3,2))
hist(ranef.glmer[[1]][[1]], col = grey(.6))
qqnorm(ranef.glmer[[1]][[1]]) ; qqline(ranef.glmer[[1]][[1]])
hist(ranef.glmer[[2]][[1]], col = grey(.6))
qqnorm(ranef.glmer[[2]][[1]]) ; qqline(ranef.glmer[[2]][[1]])
hist(ranef.glmer[[2]][[2]], col = grey(.6))
qqnorm(ranef.glmer[[2]][[2]]) ; qqline(ranef.glmer[[2]][[2]])

### NOTE: over-dispersion random-effect has complete
### disjoint area and bimodal

### Random-effects in MCMCglmm
#
### both fixed and random are in Sol, fixed first
dim(dks.glmm3.1\$Sol)

### NOTE: first 7 columns are fixed-effects, then next
### 980 are weekday estimates, then following 980 are
### weekend estimates
ranef.mcmc1 <- colMeans(dks.glmm3.1\$Sol[,8:987])
ranef.mcmc2 <- colMeans(dks.glmm3.1\$Sol[,988:1967])

par(mfrow=c(2,2))
hist(ranef.mcmc1)
qqnorm(ranef.mcmc1) ; qqline(ranef.mcmc1)
hist(ranef.mcmc2)
qqnorm(ranef.mcmc2) ; qqline(ranef.mcmc2)

### Look pretty reasonable
#
### Per Jarrod, observation level random-effects not
### stored, but "the latent variable minus the
### predictions on the link scale (i.e. type="terms")
### should be equivalent to the observation-level
### random effects in glmer."

ranef.mcmc3 <- colMeans(dks.glmm3.1\$Liab) - predict(dks.glmm3.1,
marginal = NULL, type = "terms")

hist(ranef.mcmc3)
qqnorm(ranef.mcmc3) ; qqline(ranef.mcmc3)

### NOTE: Much like glmer, the distribution of “units” is totally
### non-normal

### From Jarrod, comparing that models are similar between
### glmer and MCMCglmm

nind <- length(unique(tlfb.df\$id))
# number of individuals

nobs <- dim(tlfb.df)[1]
# number of observations

any((dks.glmer.3 at X-dks.glmm3.1\$X)@x!=0)
# fixed effect design matrix is the same

any((t(dks.glmer.3 at Zt)[,1:nobs]-Diagonal(nobs))@x!=0)
# first nobs columns are the observation-level random effects

any((t(dks.glmer.3 at Zt)[,nobs+1:(nind*2)]-dks.glmm3.1\$Z)@x!=0)
# random effect design matrix is the same

# so the models are the same.

# Generate some new data according to the posterior mean parameter
estimates from MCMCglmm:

library(MASS)

nind <- dim(dks.glmm3.1\$Z)[2]/2
# number of individuals

W <- cBind(dks.glmm3.1\$X, dks.glmm3.1\$Z)
# design matrix

beta <- colMeans(dks.glmm3.1\$Sol[,1:7])
# fixed effects

V.id <- matrix(colMeans(dks.glmm3.1\$VCV[,1:4]),2,2)
# covariance matrix of id random effects

V.obs <- mean(dks.glmm3.1\$VCV[,5])

u <- c(t(mvrnorm(dim(dks.glmm3.1\$Z)[2]/2, c(0,0), V.id)))
#simulate some new random effects

theta <- c(beta,u)

tlfb.df\$sim_drinks <- rpois(dim(W)[1],
exp((W%*%theta)@x + rnorm(dim(W)[1], 0, sqrt(V.obs))))

### compare simulated to observed drinks
table(tlfb.df\$drinks)
table(tlfb.df\$sim_drinks)

### NOTE: huge tail in simulated data that extends far beyond observed
### max of 45

--
Dave Atkins, PhD
Research Associate Professor
Department of Psychiatry and Behavioral Science
University of Washington
datkins at u.washington.edu

Center for the Study of Health and Risk Behaviors (CSHRB)
1100 NE 45th Street, Suite 300
Seattle, WA  98105
206-616-3879
http://depts.washington.edu/cshrb/
(Mon-Wed)

Center for Healthcare Improvement, for Addictions, Mental Illness,
Medically Vulnerable Populations (CHAMMP)
325 9th Avenue, 2HH-15
Box 359911
Seattle, WA 98104
http://www.chammp.org
(Thurs)

```