[R] Accounting for overdispersion in a mixed-effect model with a proportion response variable and categorical explanatory variables.

Fri Apr 22 02:49:11 CEST 2011

Richard Friedman <friedman <at> cancercenter.columbia.edu> writes:

> 
> Dear R-help-list,
> 
> 	I have a problem in which the explanatory variables are categorical,  
> the response variable is  a proportion, and experiment contains  
> technical replicates (pseudoreplicates) as well as biological  
> replicated. I am new to both generalized linear models and mixed- 
> effects models and would greatly appreciate the advice of experienced  
> analysts in this matter.
> 
> I analyzed the data in 4 ways and want to know which is the best way.  
> The 4 ways are:
> 
> 1.	A generalized linear model with binomial error in which the  
> positive and negative counts for each biological replicate is summed  
> over technical replicates.
> 2.	Same as 1 with a quasibinomial error model.
> 3.	A generalized linear mixed-effects model with binomial error in  
> which technical replication is treated as a random effect.
> 4.	A generalized linear mixed-effects model with binomial error in  
> which technical replication is treated as a random effect and  
> overdispersion is taken into account by individual level variability.

  Off the top of my head, I would say that #4 is best.
> 
> Here are the relevant data for each model:
> 
> For everything:
> 
>  > sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] lme4_0.999375-39   Matrix_0.999375-50 lattice_0.19-23
> 
> loaded via a namespace (and not attached):
> [1] grid_2.13.0   nlme_3.1-100  stats4_2.13.0 tools_2.13.0

X <- read.table("mouse.dat",header=TRUE,
colClasses=rep(c("factor","numeric"),c(3,2)))
Xagg <- aggregate(cbind(positive,negative)~treatment+mouse,data=X,FUN=sum)
## gets your aggregated data
> 
> 1.	A generalized linear model with binomial error in which the  
> positive and negative counts for each biological replicate is summed  
> over technical replicates.

model <- glm(cbind(positive,negative)~treatment,family=binomial,data=Xagg)

  If all the assumptions of the model were actually met this might be OK (since
if all observations on all mice were really independent, the sum of binomials
would be binomial) but the residual deviance suggests overdispersion (dev/df
approx. 2)

>     Null deviance: 60.467  on 17  degrees of freedom
> Residual deviance: 28.711  on 14  degrees of freedom
>   (2 observations deleted due to missingness)
> AIC: 160.35

(note that your 2 NA observations got dropped during my aggregation
step)

> 
> Since Residual deviance >> degrees of freedom I tried

  yes (I'm working through this before I see your comments, but
you seem to be on the right track)

> 
> 2.	Same as 1 with a quasibinomial error model.
> 
>  > model<-glm(y ~ treatment, quasibinomial)
>  > summary(model)
> 
> Call:
> glm(formula = y ~ treatment, family = quasibinomial)
> 
> Deviance Residuals:
>     Min       1Q   Median       3Q      Max
> -2.3134  -0.5712  -0.3288   0.8616   2.4352
> 
> Coefficients:
>                      Estimate Std. Error t value Pr(>|t|)
> (Intercept)         -0.574195   0.052173 -11.006 2.82e-08 ***
> treatmentB  0.164364   0.073180   2.246  0.04136 *
> treatmentC  0.007025   0.078306   0.090  0.92978
> treatmentD  0.258135   0.077038   3.351  0.00476 **
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> (Dispersion parameter for quasibinomial family taken to be 2.049598)
> 
>     Null deviance: 60.467  on 17  degrees of freedom
> Residual deviance: 28.711  on 14  degrees of freedom
>   (2 observations deleted due to missingness)
> AIC: NA
> 
> Number of Fisher Scoring iterations: 3
> 
  Seems reasonable.

> Analysis of Deviance Table
> 
> Model: quasibinomial, link: logit
> 
> Response: y
> 
> Terms added sequentially (first to last)
> 
>             Df Deviance Resid. Df Resid. Dev      F  Pr(>F)
> NULL                           17     60.467
> treatment  3   31.756        14     28.711 5.1646 0.01303 *
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>  >
> 
> I then tried to take the effect of pseudoreplication into account with  
> 3:
> 
> 3.	A generalized linear mixed-effects model with binomial error in  
> which technical replication is treated as a random effect.
> 
> Here is the input file:
> 
[snip]

> Since the non-mixed effects model showed evidence of overdispersion,  
> and since there is no quasibinomial option in lme4, I tried to account  
> for overdispesion by including individual level variability with..
> 
> 4.	A generalized linear mixed-effects model with binomial error in  
> which technical replication is treated as a random effect and  
> overdispersion is taken into account by individual level variability.
> 
X$obs<-1:nrow(X)

library(lme4)
## it doesn't make sense to include treatment in the random effect
##  since it is already present as a fixed effect in the model
## "mouse within treatment" (= mouse:treatment) is what you want
model2<-lmer(cbind(positive,negative)~treatment+(1|mouse:treatment)+
             (1|obs) ,data=X, family=binomial)
## note that mouse:treatment variance comes out to zero

model2B<-lmer(cbind(positive,negative)~treatment+(1|mouse:treatment),
             data=X, family=binomial)

anova(model2,model2B)
## and yet model with observation-level RE appears significantly better
## (even though this is a conservative test because the null hypothesis
## is on the boundary)

I then compared the 2 models.
> 
>  > anova(model,model2,test="F")

##   note that test="F" is ignored
> 
> 1.	Am I correct that, of the 4 models,  I should use the mixed effect  
> model with individual variability?
> Although both models make the same effects to be significant, I would  
> like to know which one I should report and use as input to a  
> subsequent multiple comparisons analysis with multiicomp.

  I think so.  It may however comfort you to see that the coefficients
and their standard errors are practically the same under each model.

  For the gold standard you may want to do parametric bootstrapping
(see ?refit in the most recent (r-forge) version of lme4)

  I would suggest that future questions along these lines go to the
r-sig-mixed-models mailing list ...

  Ben Bolker