[R-sig-ME] rare binary outcome, MCMCglmm, and priors (related to separation)

Mon Aug 30 22:37:29 CEST 2010

Some colleagues have collected data from 184 females in dating 
relationships.  Data were collected daily using PDAs; the outcome is a 
binary indicator of whether any physical aggression occurred (intimate 
partner violence, or IPV).

They are interested in 3 covariates:

-- alcohol use: yes/no
-- anger: rated on 1-5 scale
-- verbal aggression: sum of handful of items, with 0-15 scale

Their hypothesis is that the interaction of all 3 covariates will lead 
to the highest likelihood of IPV.  As you might expect, the outcome is 
very rare with 51 instances of IPV out of 8,269 days of data, and 158 
women (out of 184) reported no instances of IPV.

Question 1: Given that a GLMM will assume a normal distribution for the 
person-specific baserate in IPV, is this data even appropriate for GLMM 
or should they be looking elsewhere (perhaps GEE)?

That said, for some (unknown) proportion of individuals, there probably 
would be instances of IPV if the data collection period were longer. 
Thus, perhaps there is some basis for assuming a distribution across 
people, even if the observed data for some individuals are all zeroes.

To present some of the data (and I can check to see if it would be okay 
to make the data available), I dichotomized both anger and verbal 
aggression ("prov.cut" below):

   ang.cut prov.cut alc.cut ipv.yes ipv.no
1       0        0       0       0     3918
2       0        0       1       0        1
3       1        0       0       5     2381
4       1        0       1       1      292
5       1        1       0      36     1471
6       1        1       1       9      257

Thus, the instances of IPV are more likely when there is anger and 
verbal aggression; alcohol is a little less clear.  (And, if the 
association of anger and verbal aggression with IPV seems tautological, 
there has been debate about different forms of IPV, where some research 
has pointed to "cold" aggression.)

Not surprisingly, analyses using either glmer() or MCMCglmm() show signs 
of partial separation, with some whopping odds-ratios and 95% CI 
spanning a couple orders of magnitude.

I have read a bit about the problems of separation in logistic 
regression and know that Gelman et al suggest Bayesian priors as one 
"solution".  Moreover, I see in Jarrod Hadfield's course notes that his 
multinomial example has a "structural" zero that he addresses via priors 
on pp. 96-97, though I confess I don't quite follow exactly what he has 
done (and why).

If I just let MCMCglmm cook on a regression with all 2-way interactions 
for a long while:

prior = list(R = list(V = 1, fix = 1),
			B = list(mu = c(rep(0,7)), V = diag(7)),
			G = list(G1 = list(V = 1, nu = 0.002)))
lr.mcmc <- MCMCglmm(ipv ~ (alc.cut + angc + log(provc + 0.03))^2, data = 
ipv.df,
					family = "categorical", verbose = TRUE,
					prior = prior,
					nitt = 2000000, burnin = 1000000, thin = 1000,
					random =  ~ person)

The answers are less extreme than what I get with glmer, perhaps 
suggesting this is wandering toward the "correct" solution, though there 
are also plenty of indicators that we aren't there yet:

 > summary(lr.mcmc)

  Iterations = 1999001
  Thinning interval  = 1000001
  Sample size  = 1000

  DIC: 379.972

  G-structure:  ~person

        post.mean l-95% CI u-95% CI eff.samp
person     2.287   0.6775    4.206    194.0

  R-structure:  ~units

       post.mean l-95% CI u-95% CI eff.samp
units         1        1        1        0

  Location effects: ipv ~ (alc.cut + angc + log(provc + 0.03))^2

                           post.mean l-95% CI u-95% CI eff.samp  pMCMC
(Intercept)                -2.58724 -3.49492 -1.65250   245.20 <0.001 ***
alc.cut                     0.49512 -0.75268  1.99397  1000.00  0.464
angc                        0.02664 -0.34227  0.41365   283.90  0.880
log(provc + 0.03)           1.36626  0.98743  1.70863    28.69 <0.001 ***
alc.cut:angc               -0.16519 -0.79299  0.41949   683.56  0.590
alc.cut:log(provc + 0.03)  -0.02631 -0.53118  0.55065   157.34  0.898
angc:log(provc + 0.03)     -0.26132 -0.40141 -0.10407    74.98  0.004 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I haven't run multiple chains yet, but the effective sample sizes and 
trace plots already suggest we ain't there yet.  My specific question is 
whether there would be an alternative prior specification for the 
fixed-effects that would be more appropriate?

I would appreciate any and all thoughts here, including if this just 
doesn't seem like an appropriate data/question for GLMMs.

sessionInfo below.

cheers, Dave

 > sessionInfo()
R version 2.11.1 (2010-05-31)
i386-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    splines   stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
  [1] MCMCglmm_2.06      corpcor_1.5.7      ape_2.5-3
  [4] coda_0.13-5        Matrix_0.999375-44 lattice_0.19-10
  [7] tensorA_0.35       Hmisc_3.8-2        modeltools_0.2-16
[10] mvtnorm_0.9-92     survival_2.36-1

loaded via a namespace (and not attached):
  [1] cluster_1.13.1   coin_1.0-16      colorspace_1.0-1 gee_4.13-15
  [5] grid_2.11.1      lme4_0.999375-35 nlme_3.1-96      party_0.9-9998
  [9] rpart_3.1-46     tools_2.11.1

-- 
Dave Atkins, PhD
Research Associate Professor
Department of Psychiatry and Behavioral Science
University of Washington
datkins at u.washington.edu

Center for the Study of Health and Risk Behaviors (CSHRB)		
1100 NE 45th Street, Suite 300 	
Seattle, WA  98105 	
206-616-3879 	
http://depts.washington.edu/cshrb/
(Mon-Wed)	

Center for Healthcare Improvement, for Addictions, Mental Illness,
   Medically Vulnerable Populations (CHAMMP)
325 9th Avenue, 2HH-15
Box 359911
Seattle, WA 98104
http://www.chammp.org
(Thurs)