[R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation

Tue Sep 20 14:05:31 CEST 2011

Jeremy Koster <helixed2 at ...> writes:

> 
> Imagine that I have observed 100 people on 50 separate occasions.  
> For each observation, I record whether
> they are smoking or not.  I am interested in modeling the effect 
> of age on the likelihood of smoking.
> 
> I could envision two ways of doing this, leaving the 
> data in an unaggregated format -- that is, a dataset with
> 5000 rows.  Then specify a model with a random effect for 
> individual, such as:
> 
> smoking.logistic <- glmer (smoking ~ age + (1|Individual), family = binomial)
> 
> Alternatively, a colleague routinely aggregates data for each 
> individual, thus producing a dataset of
> 100 rows.  He then models the effect of age by writing code:
> 
> smoking.binomial <- glm (cbind(smoking observations, total observations) ~ 
      age, family = binomial)
> 
> I find this approach to be less intuitive, and I note that we 
> get very different results when switching from
> one to the other.  I lack the statistical expertise to 
> articulate the difference in the estimation of these
> models, and I would appreciate references that detail the 
> consequences of using the different
> approaches.  Specifically, to what extent does the
> aggregation within individuals obviate the need (if
> at all) for an individual-level random effect?

  It doesn't.

  Consider several individuals of the *same* age: if they
all had exactly identical probabilities of smoking (the response,
then you could aggregate all of the individual Bernoulli variables
into a single binomial variable, changing only the normalization 
constant.  Adding the among-individual variation changes the marginal
distribution from binomial to logit- or logistic-normal-binomial
(both terms are used).  See e.g. Browne et al 2005, who use
this version of what they call 'additive overdispersion' in
a logistic model.

  The significance of the random effect is a test of the null
hypothesis that all individuals with the same age have exactly
the same probability of smoking ...

Browne, W. J, S. V Subramanian, K. Jones, and
H. Goldstein. 2005. “Variance partitioning in multilevel logistic
models that exhibit overdispersion.” Journal of the Royal Statistical
Society: Series A (Statistics in Society) 168 (3) (July 1):
599-613. http://dx.doi.org/10.1111/j.1467-985X.2004.00365.x