[R-sig-ME] mixed effects model proportion data that are not 0 and 1's

Ben Bolker bbolker at gmail.com
Mon Oct 12 20:16:35 CEST 2015


On 15-10-12 01:42 PM, M West wrote:
> Hello,
> 
> I am trying to decide the best approach for analyzing a short time series.
> 
> The goal is to see if there is a significant relationship between the
> percentage change in females (# females/total population during epidemic -
> # females/total population pre-epidemic) and disease prevalence (continuous
> variable: percentage infected).
> 
> Samples are across 15 sites and 5 months. So a very short time series.
> 
> Here's the model and weights that I've used thus far.
> v3 <- varComb(varIdent(form =~ 1 | Month) , varExp(form =~
> Proportion_infected))
> mod <- lme(Change_in_proportion_females ~ Proportion_infected + Month,
> random = ~ 1 |Site/Month, weights = v3)
> 
> The main problems are:
> 1) If I plot the residuals vs. the fitted values, there is a strong
> relationship demonstrating that the variance increase with the mean. How do
> I account for this in a mixed effects model? I've tried the weights option
> (with a couple of variations), however, I receive an error message:
> "iteration limit reached without convergence"

  weights= is the right way to go about it; we'd have to see more
details to know what the problem is in your particular case.

 Your model looks like it might be overfitted: if you only have one
observation per Site/Month combination, then the Site-by-Month
interaction term (Site/Month expands to "Site plus Site-by-Month")
will be confounded with the residual variance.  I have found in the
past that lme may allow you to fit the overfitted model without
complaining; one way of diagnosing this problem is to try intervals() on
your fitted model (try it without the weights= argument for
simplicity) and see if you get warnings/errors.

> 
> 2) I also have a question about the way that R treats proportion data that
> are not 0 and 1's when using glm (or glmer) with the binomial distribution.
> Crawley, for example, suggests that you create a y variable where you
> account for the total number of observations (e.g., y <- cbind(total
> females, (total population - total females)) and then run a glm (or in may
> case, a generalized linear mixed model) using the binomial distribution.
> 
> But what is actually going on under the hood here? Is R running some sort
> of proportion test?  All the examples that I have found for proportion data
> or non-normal residuals suggest using a logistic regression approach with
> the binomial distribution. Is this also the best option for proportion data
> that are not 0 and 1's (i.e., my data on the percentage change would not
> match the typical logistic regression plot)?

  Binomial and Bernoulli (binary) logistic regression are basically
equivalent; when you have multiple individuals observed for a particular
set of predictor variables (e.g. month/site), it's practical
and computationally efficient to collapse them to a binomial regression.
Under the hood there is indeed the equivalent of a proportion test --
not necessarily quite as accurate for small numbers of counts as some
classical proportion tests, but much more flexible.  In lme4 (the
successor to nlme, which implements generalized linear MMs that can
handle a binomial response), the 'weights' argument is simpler/can be
used to specify the denominator or total number of counts:

library(lme4)
 mod <- glmer(Change_in_proportion_females ~ Proportion_infected + Month
+ (1 |Site/Month),
family=binomial, weights = total_counts, data=...)

 (here, including the site-by-month interaction is OK -- it represents
the possibility of extra-binomial variance (overdispersion).)



More information about the R-sig-mixed-models mailing list