[R-sig-ME] rare binary outcome, MCMCglmm, and priors (related to separation)
David Atkins
datkins at u.washington.edu
Mon Aug 30 22:37:29 CEST 2010
Some colleagues have collected data from 184 females in dating
relationships. Data were collected daily using PDAs; the outcome is a
binary indicator of whether any physical aggression occurred (intimate
partner violence, or IPV).
They are interested in 3 covariates:
-- alcohol use: yes/no
-- anger: rated on 1-5 scale
-- verbal aggression: sum of handful of items, with 0-15 scale
Their hypothesis is that the interaction of all 3 covariates will lead
to the highest likelihood of IPV. As you might expect, the outcome is
very rare with 51 instances of IPV out of 8,269 days of data, and 158
women (out of 184) reported no instances of IPV.
Question 1: Given that a GLMM will assume a normal distribution for the
person-specific baserate in IPV, is this data even appropriate for GLMM
or should they be looking elsewhere (perhaps GEE)?
That said, for some (unknown) proportion of individuals, there probably
would be instances of IPV if the data collection period were longer.
Thus, perhaps there is some basis for assuming a distribution across
people, even if the observed data for some individuals are all zeroes.
To present some of the data (and I can check to see if it would be okay
to make the data available), I dichotomized both anger and verbal
aggression ("prov.cut" below):
ang.cut prov.cut alc.cut ipv.yes ipv.no
1 0 0 0 0 3918
2 0 0 1 0 1
3 1 0 0 5 2381
4 1 0 1 1 292
5 1 1 0 36 1471
6 1 1 1 9 257
Thus, the instances of IPV are more likely when there is anger and
verbal aggression; alcohol is a little less clear. (And, if the
association of anger and verbal aggression with IPV seems tautological,
there has been debate about different forms of IPV, where some research
has pointed to "cold" aggression.)
Not surprisingly, analyses using either glmer() or MCMCglmm() show signs
of partial separation, with some whopping odds-ratios and 95% CI
spanning a couple orders of magnitude.
I have read a bit about the problems of separation in logistic
regression and know that Gelman et al suggest Bayesian priors as one
"solution". Moreover, I see in Jarrod Hadfield's course notes that his
multinomial example has a "structural" zero that he addresses via priors
on pp. 96-97, though I confess I don't quite follow exactly what he has
done (and why).
If I just let MCMCglmm cook on a regression with all 2-way interactions
for a long while:
prior = list(R = list(V = 1, fix = 1),
B = list(mu = c(rep(0,7)), V = diag(7)),
G = list(G1 = list(V = 1, nu = 0.002)))
lr.mcmc <- MCMCglmm(ipv ~ (alc.cut + angc + log(provc + 0.03))^2, data =
ipv.df,
family = "categorical", verbose = TRUE,
prior = prior,
nitt = 2000000, burnin = 1000000, thin = 1000,
random = ~ person)
The answers are less extreme than what I get with glmer, perhaps
suggesting this is wandering toward the "correct" solution, though there
are also plenty of indicators that we aren't there yet:
> summary(lr.mcmc)
Iterations = 1999001
Thinning interval = 1000001
Sample size = 1000
DIC: 379.972
G-structure: ~person
post.mean l-95% CI u-95% CI eff.samp
person 2.287 0.6775 4.206 194.0
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 1 1 1 0
Location effects: ipv ~ (alc.cut + angc + log(provc + 0.03))^2
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) -2.58724 -3.49492 -1.65250 245.20 <0.001 ***
alc.cut 0.49512 -0.75268 1.99397 1000.00 0.464
angc 0.02664 -0.34227 0.41365 283.90 0.880
log(provc + 0.03) 1.36626 0.98743 1.70863 28.69 <0.001 ***
alc.cut:angc -0.16519 -0.79299 0.41949 683.56 0.590
alc.cut:log(provc + 0.03) -0.02631 -0.53118 0.55065 157.34 0.898
angc:log(provc + 0.03) -0.26132 -0.40141 -0.10407 74.98 0.004 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I haven't run multiple chains yet, but the effective sample sizes and
trace plots already suggest we ain't there yet. My specific question is
whether there would be an alternative prior specification for the
fixed-effects that would be more appropriate?
I would appreciate any and all thoughts here, including if this just
doesn't seem like an appropriate data/question for GLMMs.
sessionInfo below.
cheers, Dave
> sessionInfo()
R version 2.11.1 (2010-05-31)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 splines stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] MCMCglmm_2.06 corpcor_1.5.7 ape_2.5-3
[4] coda_0.13-5 Matrix_0.999375-44 lattice_0.19-10
[7] tensorA_0.35 Hmisc_3.8-2 modeltools_0.2-16
[10] mvtnorm_0.9-92 survival_2.36-1
loaded via a namespace (and not attached):
[1] cluster_1.13.1 coin_1.0-16 colorspace_1.0-1 gee_4.13-15
[5] grid_2.11.1 lme4_0.999375-35 nlme_3.1-96 party_0.9-9998
[9] rpart_3.1-46 tools_2.11.1
--
Dave Atkins, PhD
Research Associate Professor
Department of Psychiatry and Behavioral Science
University of Washington
datkins at u.washington.edu
Center for the Study of Health and Risk Behaviors (CSHRB)
1100 NE 45th Street, Suite 300
Seattle, WA 98105
206-616-3879
http://depts.washington.edu/cshrb/
(Mon-Wed)
Center for Healthcare Improvement, for Addictions, Mental Illness,
Medically Vulnerable Populations (CHAMMP)
325 9th Avenue, 2HH-15
Box 359911
Seattle, WA 98104
http://www.chammp.org
(Thurs)
More information about the R-sig-mixed-models
mailing list