[R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation
Ben Bolker
bbolker at gmail.com
Tue Sep 20 14:05:31 CEST 2011
Jeremy Koster <helixed2 at ...> writes:
>
> Imagine that I have observed 100 people on 50 separate occasions.
> For each observation, I record whether
> they are smoking or not. I am interested in modeling the effect
> of age on the likelihood of smoking.
>
> I could envision two ways of doing this, leaving the
> data in an unaggregated format -- that is, a dataset with
> 5000 rows. Then specify a model with a random effect for
> individual, such as:
>
> smoking.logistic <- glmer (smoking ~ age + (1|Individual), family = binomial)
>
> Alternatively, a colleague routinely aggregates data for each
> individual, thus producing a dataset of
> 100 rows. He then models the effect of age by writing code:
>
> smoking.binomial <- glm (cbind(smoking observations, total observations) ~
age, family = binomial)
>
> I find this approach to be less intuitive, and I note that we
> get very different results when switching from
> one to the other. I lack the statistical expertise to
> articulate the difference in the estimation of these
> models, and I would appreciate references that detail the
> consequences of using the different
> approaches. Specifically, to what extent does the
> aggregation within individuals obviate the need (if
> at all) for an individual-level random effect?
It doesn't.
Consider several individuals of the *same* age: if they
all had exactly identical probabilities of smoking (the response,
then you could aggregate all of the individual Bernoulli variables
into a single binomial variable, changing only the normalization
constant. Adding the among-individual variation changes the marginal
distribution from binomial to logit- or logistic-normal-binomial
(both terms are used). See e.g. Browne et al 2005, who use
this version of what they call 'additive overdispersion' in
a logistic model.
The significance of the random effect is a test of the null
hypothesis that all individuals with the same age have exactly
the same probability of smoking ...
Browne, W. J, S. V Subramanian, K. Jones, and
H. Goldstein. 2005. “Variance partitioning in multilevel logistic
models that exhibit overdispersion.” Journal of the Royal Statistical
Society: Series A (Statistics in Society) 168 (3) (July 1):
599-613. http://dx.doi.org/10.1111/j.1467-985X.2004.00365.x
More information about the R-sig-mixed-models
mailing list