[R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation

Tue Sep 20 00:44:10 CEST 2011

Imagine that I have observed 100 people on 50 separate occasions.  For each observation, I record whether they are smoking or not.  I am interested in modeling the effect of age on the likelihood of smoking.

I could envision two ways of doing this, leaving the data in an unaggregated format -- that is, a dataset with 5000 rows.  Then specify a model with a random effect for individual, such as:

smoking.logistic <- glmer (smoking ~ age + (1|Individual), family = binomial)

Alternatively, a colleague routinely aggregates data for each individual, thus producing a dataset of 100 rows.  He then models the effect of age by writing code:

smoking.binomial <- glm (cbind(smoking observations, total observations) ~ age, family = binomial)

I find this approach to be less intuitive, and I note that we get very different results when switching from one to the other.  I lack the statistical expertise to articulate the difference in the estimation of these models, and I would appreciate references that detail the consequences of using the different approaches.  Specifically, to what extent does the aggregation within individuals obviate the need (if at all) for an individual-level random effect?