[R-sig-ME] What to do when a subset of binomial data has only positive outcomes

Sat Jul 31 15:22:19 CEST 2010

You may want to take a look at A WEAKLY INFORMATIVE DEFAULT PRIOR  
DISTRIBUTION FOR LOGISTIC AND OTHER REGRESSION MODELS Gelman et. al.  
2008 The annals of applied statistics 2 pp1360-1383 which you will  
need to implement in BUGS or JAGS. However, an easier route may be to  
to use normal priors on the coefficients and use MCMCglmm. In many  
cases I think this prior specification has good properties. For example:

dat<-data.frame(y = gl(2,50, labels=c(0,1)), tr = gl(2,50)) # data  
with complete separation

# standard glm #

summary(glm(y~tr-1, family="binomial", data=dat))  # SE's and p-values  
suffer from the Hauck-Donner effect

# Bayesian glm - logit link #

prior1<-list(R=list(V=1, fix=1), B=list(V=diag(2)*(pi^2/3+1), mu=rep(0,2)))

# residual variance is not identifiable in the likelihood for binary  
data so fixed at 1
# give the coefficients independent prior distribution with mean zero  
and variance pi^2/3 (variance of the logistic distribution) + 1 (the  
residual variance)

m1<-MCMCglmm(y~tr-1, family="categorical", data=dat, prior=prior1,slice=T)

# use family categorical: link is logit (plogis)

c2 <- (16 * sqrt(3)/(15 * pi))^2

HPDinterval(plogis(m1$Sol[,1]/sqrt(1+c2)))
HPDinterval(plogis(m1$Sol[,2]/sqrt(1+c2)))

# get credible intervals for probability of success for both levels of  
tr after marginalising the residuals

# Bayesian glm - probit link #

prior2<-list(R=list(V=1, fix=1), B=list(V=diag(2)*2, mu=rep(0,2)))

# again residual variance is not identifiable in the likelihood for  
binary data so fixed at 1
# give the coefficients independent prior distribution with mean zero  
and variance 1 (the variance of the unit normal) + 1 (the residual  
variance)

m2<-MCMCglmm(y~tr-1, family="ordinal", data=dat, prior=prior2, slice=T)

# use family ordinal: link is probit (pnorm)

c2 <- 1

HPDinterval(pnorm(m2$Sol[,1]/sqrt(1+c2)))
HPDinterval(pnorm(m2$Sol[,2]/sqrt(1+c2)))

# get credible intervals for probability of success for both levels of  
tr after marginalising the residuals.

You will probably need more iterations that the default when complete  
separation exists.

Cheers,

Jarrod

Quoting Ben Bolker <bbolker at gmail.com>:

>   The bleeding-edge version of lme4, lme4a, has some profiling
> capabilities that might (?) be useful. A Bayesian approach (which you
> would probably have to roll yourself; glmmBUGS exists but did not seem
> flexible enough to deal with crossed random effects the last time I
> looked) would also be useful for stabilizing this kind of estimation
> problem (i.e., assigning a prior would allow the posterior probability
> density to be not quite completely concentrated at prob=0 or prob=1
> ...)  For inspiration you might also try looking in the GLM literature
> under the keyword 'complete separation'.
>
> On Fri, Jul 30, 2010 at 10:28 AM,  <Timothy_Handley at nps.gov> wrote:
>>
>> I had this problem with glm. My solution was to create a likelihood
>> function, and use the optim or optimize function to find an appropriate
>> confidence interval. More specifically, I found the likelihood of the data
>> for the best-estimate parameters, then found the range of parameters for
>> which a likelihood-ratio-test (best-estimate params vs. current params) had
>> a p-value>.025. The range is then the 95% confidence interval for those
>> parameters. This is essentially replicating the functionality of glm by
>> hand, as it seemed unable to deal with data which is all positive or all
>> negative.
>>
>> While I understand how to do this with glm, I don't understand enough about
>> fixed effects models to do the same for lmer. If you were willing to treat
>> everything as a fixed effect and switch to glm, then you could e-mail me,
>> and I could offer some more specific advice on how to do this.
>>
>> If anyone else has a better solution, I'd be very interested to hear it.
>>
>> Tim Handley
>> Fire Effects Monitor
>> Santa Monica Mountains National Recreation Area
>> 401 W. Hillcrest Dr.
>> Thousand Oaks, CA 91360
>> 805-370-2300 x2412
>>
>> Date: Thu, 29 Jul 2010 17:57:35 -0700
>> From: Xiao He <praguewatermelon at gmail.com>
>> To: r-sig-mixed-models at r-project.org
>> Subject: [R-sig-ME] What to do when a subset of binomial data has only
>>             positive outcomes
>> Message-ID:
>>             <AANLkTi=zskcerd9fuELVQZ833mYPRkCddYpRUdVvQq7- at mail.gmail.com>
>> Content-Type: text/plain
>>
>> Dear R users and experts,
>>
>> I tried to fit a model as shown below:
>>
>> data.lmer<-lmer(word1~compoundType*nativelanguage + (1|subject) + (1|word),
>> data=data, family="binomial").
>>
>> the factor 'compoundType' has three levels: blue, green, red.
>> the factor nativelanguage has two levels: english, other
>>
>> I obtained the following results:
>> ************************************************
>> Fixed effects:
>>                              Estimate Std. Error z value Pr(>|z|)
>> (Intercept)                    -1.0190     0.5777  -1.764   0.0778 .
>> nativelanguageother            -0.6862     0.4075  -1.684   0.0922 .
>> typegreen                      19.7882   728.0627   0.027   0.9783
>> typered                         3.6985     0.8359   4.425 9.65e-06 ***
>> nativelanguageother:typegreen -16.1915   728.0624  -0.022   0.9823
>> nativelanguageother:typered    -2.8106     0.5639  -4.984 6.23e-07 ***
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> ************************************************
>>
>> As you can see, the standard errors and estimates for "typegreen" as well
>> as
>> "nativelanguageother:typegreen" are enormous. I examined my data and
>> realized that the responses obtained for the level "green" of compoundType
>> were all positive. But the responses of "green" were clearly significantly
>> different from the base level "blue" as green was 100% positive responses,
>> whereas blue had only 30% of positive responses.
>>
>> I wonder if there is any way to conduct statistical analyses to show that
>> green is significantly different from blue (and potentially red) because
>> obviously I can't just say they are different :)...
>>
>> Thank you in advance for your help!
>>
>>
>>
>> Best,
>>
>> Xiao He
>> Graduate student
>> Department of Linguistics
>> University of Southern California
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.