[R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation

Thu Sep 22 04:58:46 CEST 2011

Jeremy Koster <helixed2 at ...> writes:

> 
> Thanks to Ben, Thomas, and David for the responses.  
> As usual, the explanations didn't really sink in
> until I experimented with different formulations of models.
> 
> Somewhat surprisingly, I discovered that lme4 can 
> handle the following syntax for a dataset in which
> observations have been aggregated into rows by individuals 
> (i.e., n = 100 rows):
> 
> smoking.aggregated <- glmer (cbind(smoking observations, 
>   total observations) ~ AGE + (1|Individual),
> family = binomial, data = aggregated)
> 
> which produces the same estimates of the intercept, AGE,
>  and Individual-level variance as the following
> code, which refers to an unaggregated dataset of 5000 rows 
> (100 individuals with 50 observations each),
> with a binary variable for smoking (or not):
> 
> smoking.unaggregated <- glmer (smoking ~ AGE + 
> (1|Individual), family = binomial, data = unaggregated)
> 
> Where the models differ is the value for the log-likelihood
>  (and AIC).  Considering that the estimates for
> covariates and random effects were identical in both models, 
> my first guess was that lme4 was basically
> treating the aggregated data like an unaggregated dataset.
> 
> Why, then, does the log-likelihood differ (substantially)?  
> What are the implications for model
> selection, using AIC, for example?  Is it possible that 
> one might choose different models depending on
> whether the data have been aggregated or not?
> 

  Consider the differences between these values:

  dbinom(10,prob=0.5,size=20,log=TRUE)
[1] -1.736152
> sum(dbinom(rep(0:1,each=10),prob=0.5,size=1,log=TRUE))
[1] -13.86294

  The difference is lchoose(20,10) -- that is the only
difference is in the normalization constant (because the
aggregated form includes all possible orderings of the 0s
and 1s within a group).
  As long as you only consider *differences* in AIC or
in log-likelihood, the models should all be the same --
so any inferences about the effects of parameters etc.
should all be the same.  Try some experiments to convince
yourself.