[R-sig-ME] over-dispersed Poisson and glmer / MCMCglmm
David Atkins
datkins at u.washington.edu
Fri Jan 7 18:56:39 CET 2011
Hi all--
Happy new year. In some ways this is a bit off-topic, but I would be
interested in any thoughts that folks might have.
Background: Examining covariates of university student's daily drinking
(gender, weekend vs. weekday, fraternity/sorority member or not). This
is a revised version of a dataset that I mentioned a while back on the
listserv (related to computational speed). These daily reports
encompass student's 21st birthdays, so we have some fairly extreme
drinking (and certainly some inaccuracies of recall; eg, someone
reported 45 drinks, which would kill all but the most hardened
alcoholics... though I never cease to be amazed at how much and how
students drink).
The distribution of the outcome is quite extreme:
0 1 2 3 4 5 6 7 8 9
18317 1015 926 690 599 568 401 227 319 97
There are 23,992 observations in total, so we have many, many zeroes but
also over 5,000 non-zeroes.
Obviously, even at this stage we might think of a zero-inflated/altered
model, but I examined an over-dispersed Poisson GLMM as some folks have
argued that highly skewed, low mean data can be well-modeled by negative
binomial regression.
Data and code are below to fit an over-dispersed Poisson GLMM in both
glmer and MCMCglmm. Here are some observations:
1. The two approaches lead to substantively different estimates of fixed
and random effects.
2. Estimates of the marginal distribution of counts from both models are
surprisingly good.
3. *But*, results from both models show wickedly non-normal distribution
for the per-observation (over-dispersion) random-effects.
Fiddling with priors and number of iterations does not seem to change
results for MCMCglmm (at least what I tried).
The problem in #3 goes away if we fit a zero-altered mixed model using
MCMCglmm.
Thus, this is why I started by saying this is a bit off-topic. I
originally thought that perhaps one or the other of glmer and MCMCglmm
might be having troubles in some way. Now, I feel pretty confident that
*both* are. Thus, pragmatically, we can just move ahead with a
zero-altered model.
I kick it out here to see if folks have any other input. Intuitively,
given the distribution of the outcome, I could understand why an
over-dispersed Poisson GLMM might have difficulty. However, I would be
particularly interested in whether there is something more specific to
be said.
Finally, per usual, I want to give a shout-out (American slang for
accolade...) to Jarrod who helped me with some of the diagnosing of what
was going on. In particular, see code below from Jarrod for estimating
units and simulating from MCMCglmm.
Sorry for the wordy email.
cheers, Dave
> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MASS_7.3-9 MCMCglmm_2.10 corpcor_1.5.7 ape_2.6-2
coda_0.14-2
[6] tensorA_0.36 lme4_0.999375-37 Matrix_0.999375-47
lattice_0.19-17
loaded via a namespace (and not attached):
[1] gee_4.13-16 grid_2.12.1 nlme_3.1-97 stats4_2.12.1
glmer fit:
> print(dks.glmer.3, cor = F)
Generalized linear mixed model fit by the Laplace approximation
Formula: drinks ~ (weekend + gender + fratsor)^2 + (weekend - 1 | id) +
(1 | over)
Data: tlfb.df
AIC BIC logLik deviance
35961 36050 -17970 35939
Random effects:
Groups Name Variance Std.Dev. Corr
over (Intercept) 12.79661 3.57723
id weekendWeekday 0.14557 0.38154
weekendWeekend 0.80957 0.89976 -0.651
Number of obs: 23992, groups: over, 23992; id, 980
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.3684656 0.0805109 -54.26 < 2e-16 ***
weekendWeekend 1.3314914 0.1254440 10.61 < 2e-16 ***
genderMen 0.3592023 0.1153260 3.11 0.00184 **
fratsorFratSor 0.3345262 0.1704030 1.96 0.04963 *
weekendWeekend:genderMen 0.4945154 0.1740287 2.84 0.00449 **
weekendWeekend:fratsorFratSor -0.0008532 0.2062541 0.00 0.99670
genderMen:fratsorFratSor 0.5618560 0.1884112 2.98 0.00286 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
MCMCglmm fit:
> summary(dks.glmm3.1)
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 38559.60
G-structure: ~us(weekend):id
post.mean l-95% CI u-95% CI eff.samp
Weekday:Weekday.id 1.483 1.2583 1.721 334.7
Weekend:Weekday.id 1.130 0.9338 1.326 498.9
Weekday:Weekend.id 1.130 0.9338 1.326 498.9
Weekend:Weekend.id 1.677 1.3889 1.944 486.3
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 5.21 4.966 5.463 140.8
Location effects: drinks ~ (weekend + gender + fratsor)^2
post.mean l-95% CI u-95% CI eff.samp
pMCMC
(Intercept) -3.28303 -3.45352 -3.13054 353.7
<0.001 ***
weekendWeekend 1.55400 1.37543 1.70599 667.5
<0.001 ***
genderMen 0.43253 0.22656 0.65524 848.2
<0.001 ***
fratsorFratSor 0.57028 0.24066 0.93947 1000.0
<0.001 ***
weekendWeekend:genderMen 0.20535 -0.02118 0.43877 707.2
0.090 .
weekendWeekend:fratsorFratSor -0.27792 -0.55778 -0.01192 846.4
0.054 .
genderMen:fratsorFratSor 0.51294 0.09726 0.96072 1000.0
0.016 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R code for setting up and fitting models:
tlfb.df <-
read.csv("http://depts.washington.edu/cshrb/newweb/stats%20documents/TLFB.Final.csv",
header = TRUE)
str(tlfb.df)
### set categorical variables as factors
tlfb.df <- within(tlfb.df, {
gender <- factor(gender, 0:1, c("Women","Men"))
weekend <- factor(weekend, 0:1, c("Weekday","Weekend"))
fratsor <- factor(fratsor, 0:1, c("Not FratSor","FratSor"))
})
### Number of participants
length(unique(tlfb.df$id)) # 980
### how much data per person?
table(table(tlfb.df$id)) # 719 have all 30 days
### distribution of outcome
plot(table(tlfb.df$drinks), lwd=3, ylab = "Frequency",
xlab = "Number of Drinks",
main= "TLFB Data")
### examine means by covariates
with(tlfb.df, tapply(drinks, list(weekend, fratsor, gender),
mean, na.rm = TRUE))
### GLMM – over-dispersed Poisson
library(lme4)
### generate variable to capture over-dispersion
tlfb.df$over <- 1:nrow(tlfb.df)
### check contrasts
contrasts(tlfb.df$weekend)
contrasts(tlfb.df$gender)
contrasts(tlfb.df$fratsor)
### all 2-way interactions in fixed-effects
### random intercept and weekend (parameterized to be similar to
### MCMCglmm), plus per-observation random-effect
dks.glmer.3 <- glmer(drinks ~ (weekend + gender + fratsor)^2 +
(weekend -1 | id) + (1 | over),
data = tlfb.df, verbose = TRUE,
family = poisson)
# Comment from Jarrod:
# Using (weekend-1 | id) so that the covariance matrix
# refers to not-weekned and weekned rather than not-weeknd
# and the contrast between weekend and not-weekend.
# Alternativley you could put us(weekend+1 |):id in
# MCMCglmm which is equivalent to (weekend | id) in lmer.
print(dks.glmer.3, cor = F)
library(MCMCglmm)
prior2 <- list(R=list(V=1, nu=0.002),
G=list(G1=list(V=diag(2), nu=1.002)))
### NOTE: results largely the same over default, 65K and 500K iterations
dks.glmm3.1 <- MCMCglmm(drinks ~ (weekend + gender + fratsor)^2,
data = tlfb.df, family = "poisson",
random = ~ us(weekend):id,
prior = prior2,
verbose = TRUE, pr = TRUE, pl = TRUE)
summary(dks.glmm3.1)
### traceplots of variance terms – at default iterations not bad
### though as noted above, substantive results don’t change at much
### higher iterations
plot(dks.glmm3.1$VCV)
### how many zeroes (and counts generally) are each predicting?
glmer.fit <- fitted(dks.glmer.3)
mcmc.fit <- exp(colMeans(dks.glmm3.1$Liab))
### get predicted counts up to 20 drinks
pred.tlfb <- matrix(NA, ncol=2, nrow=20)
for (i in 0:19) {
pred.tlfb[i+1, 1] <- sum(dpois(i, glmer.fit))
pred.tlfb[i+1, 2] <- sum(dpois(i, mcmc.fit))
}
cbind(Obs = table(tlfb.df$drinks)[1:20],
glmer = pred.tlfb[,1],
mcmc = pred.tlfb[,2])
### NOTE: somewhat surprising how well these models predict
### the marginal distribution of counts, but…
#
### Examine distribution of random-effects for both models
ranef.glmer <- ranef(dks.glmer.3)
str(ranef.glmer)
par(mfrow=c(3,2))
hist(ranef.glmer[[1]][[1]], col = grey(.6))
qqnorm(ranef.glmer[[1]][[1]]) ; qqline(ranef.glmer[[1]][[1]])
hist(ranef.glmer[[2]][[1]], col = grey(.6))
qqnorm(ranef.glmer[[2]][[1]]) ; qqline(ranef.glmer[[2]][[1]])
hist(ranef.glmer[[2]][[2]], col = grey(.6))
qqnorm(ranef.glmer[[2]][[2]]) ; qqline(ranef.glmer[[2]][[2]])
### NOTE: over-dispersion random-effect has complete
### disjoint area and bimodal
### Random-effects in MCMCglmm
#
### both fixed and random are in Sol, fixed first
head(dks.glmm3.1$Sol[,1:10])
dim(dks.glmm3.1$Sol)
### NOTE: first 7 columns are fixed-effects, then next
### 980 are weekday estimates, then following 980 are
### weekend estimates
ranef.mcmc1 <- colMeans(dks.glmm3.1$Sol[,8:987])
ranef.mcmc2 <- colMeans(dks.glmm3.1$Sol[,988:1967])
par(mfrow=c(2,2))
hist(ranef.mcmc1)
qqnorm(ranef.mcmc1) ; qqline(ranef.mcmc1)
hist(ranef.mcmc2)
qqnorm(ranef.mcmc2) ; qqline(ranef.mcmc2)
### Look pretty reasonable
#
### Per Jarrod, observation level random-effects not
### stored, but "the latent variable minus the
### predictions on the link scale (i.e. type="terms")
### should be equivalent to the observation-level
### random effects in glmer."
ranef.mcmc3 <- colMeans(dks.glmm3.1$Liab) - predict(dks.glmm3.1,
marginal = NULL, type = "terms")
hist(ranef.mcmc3)
qqnorm(ranef.mcmc3) ; qqline(ranef.mcmc3)
### NOTE: Much like glmer, the distribution of “units” is totally
### non-normal
### From Jarrod, comparing that models are similar between
### glmer and MCMCglmm
nind <- length(unique(tlfb.df$id))
# number of individuals
nobs <- dim(tlfb.df)[1]
# number of observations
any((dks.glmer.3 at X-dks.glmm3.1$X)@x!=0)
# fixed effect design matrix is the same
any((t(dks.glmer.3 at Zt)[,1:nobs]-Diagonal(nobs))@x!=0)
# first nobs columns are the observation-level random effects
any((t(dks.glmer.3 at Zt)[,nobs+1:(nind*2)]-dks.glmm3.1$Z)@x!=0)
# random effect design matrix is the same
# so the models are the same.
# Generate some new data according to the posterior mean parameter
estimates from MCMCglmm:
library(MASS)
nind <- dim(dks.glmm3.1$Z)[2]/2
# number of individuals
W <- cBind(dks.glmm3.1$X, dks.glmm3.1$Z)
# design matrix
beta <- colMeans(dks.glmm3.1$Sol[,1:7])
# fixed effects
V.id <- matrix(colMeans(dks.glmm3.1$VCV[,1:4]),2,2)
# covariance matrix of id random effects
V.obs <- mean(dks.glmm3.1$VCV[,5])
u <- c(t(mvrnorm(dim(dks.glmm3.1$Z)[2]/2, c(0,0), V.id)))
#simulate some new random effects
theta <- c(beta,u)
tlfb.df$sim_drinks <- rpois(dim(W)[1],
exp((W%*%theta)@x + rnorm(dim(W)[1], 0, sqrt(V.obs))))
### compare simulated to observed drinks
table(tlfb.df$drinks)
table(tlfb.df$sim_drinks)
### NOTE: huge tail in simulated data that extends far beyond observed
### max of 45
--
Dave Atkins, PhD
Research Associate Professor
Department of Psychiatry and Behavioral Science
University of Washington
datkins at u.washington.edu
Center for the Study of Health and Risk Behaviors (CSHRB)
1100 NE 45th Street, Suite 300
Seattle, WA 98105
206-616-3879
http://depts.washington.edu/cshrb/
(Mon-Wed)
Center for Healthcare Improvement, for Addictions, Mental Illness,
Medically Vulnerable Populations (CHAMMP)
325 9th Avenue, 2HH-15
Box 359911
Seattle, WA 98104
http://www.chammp.org
(Thurs)
More information about the R-sig-mixed-models
mailing list