[R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation
Ben Bolker
bbolker at gmail.com
Thu Sep 22 04:58:46 CEST 2011
Jeremy Koster <helixed2 at ...> writes:
>
> Thanks to Ben, Thomas, and David for the responses.
> As usual, the explanations didn't really sink in
> until I experimented with different formulations of models.
>
> Somewhat surprisingly, I discovered that lme4 can
> handle the following syntax for a dataset in which
> observations have been aggregated into rows by individuals
> (i.e., n = 100 rows):
>
> smoking.aggregated <- glmer (cbind(smoking observations,
> total observations) ~ AGE + (1|Individual),
> family = binomial, data = aggregated)
>
> which produces the same estimates of the intercept, AGE,
> and Individual-level variance as the following
> code, which refers to an unaggregated dataset of 5000 rows
> (100 individuals with 50 observations each),
> with a binary variable for smoking (or not):
>
> smoking.unaggregated <- glmer (smoking ~ AGE +
> (1|Individual), family = binomial, data = unaggregated)
>
> Where the models differ is the value for the log-likelihood
> (and AIC). Considering that the estimates for
> covariates and random effects were identical in both models,
> my first guess was that lme4 was basically
> treating the aggregated data like an unaggregated dataset.
>
> Why, then, does the log-likelihood differ (substantially)?
> What are the implications for model
> selection, using AIC, for example? Is it possible that
> one might choose different models depending on
> whether the data have been aggregated or not?
>
Consider the differences between these values:
dbinom(10,prob=0.5,size=20,log=TRUE)
[1] -1.736152
> sum(dbinom(rep(0:1,each=10),prob=0.5,size=1,log=TRUE))
[1] -13.86294
The difference is lchoose(20,10) -- that is the only
difference is in the normalization constant (because the
aggregated form includes all possible orderings of the 0s
and 1s within a group).
As long as you only consider *differences* in AIC or
in log-likelihood, the models should all be the same --
so any inferences about the effects of parameters etc.
should all be the same. Try some experiments to convince
yourself.
More information about the R-sig-mixed-models
mailing list