[R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation

Wed Sep 21 23:56:26 CEST 2011

Thanks to Ben, Thomas, and David for the responses.  As usual, the explanations didn't really sink in until I experimented with different formulations of models.

Somewhat surprisingly, I discovered that lme4 can handle the following syntax for a dataset in which observations have been aggregated into rows by individuals (i.e., n = 100 rows):

smoking.aggregated <- glmer (cbind(smoking observations, total observations) ~ AGE + (1|Individual), family = binomial, data = aggregated)

which produces the same estimates of the intercept, AGE, and Individual-level variance as the following code, which refers to an unaggregated dataset of 5000 rows (100 individuals with 50 observations each), with a binary variable for smoking (or not):

smoking.unaggregated <- glmer (smoking ~ AGE + (1|Individual), family = binomial, data = unaggregated)

Where the models differ is the value for the log-likelihood (and AIC).  Considering that the estimates for covariates and random effects were identical in both models, my first guess was that lme4 was basically treating the aggregated data like an unaggregated dataset.

Why, then, does the log-likelihood differ (substantially)?  What are the implications for model selection, using AIC, for example?  Is it possible that one might choose different models depending on whether the data have been aggregated or not?

--- On Tue, 9/20/11, Thomas Levine <tkl22 at cornell.edu> wrote:

From: Thomas Levine <tkl22 at cornell.edu>
Subject: Re: [R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation
To: "David Winsemius" <dwinsemius at comcast.net>
Cc: "Jeremy Koster" <helixed2 at yahoo.com>, r-sig-mixed-models at r-project.org
Date: Tuesday, September 20, 2011, 12:48 AM

I believe that your approach is appropriate and that your colleague's approach is strange for the reason that David gave.
The non-hierarchical models run much faster, and I suspect that this is why your colleague is using them. Even if you have the power to run the hierarchical models, your colleague may have learned methods that were used before hierarchical modeling was feasible.

A simulation may help you understand this better. Here's a start at that.
Tom

On Mon, Sep 19, 2011 at 10:55 PM, David Winsemius <dwinsemius at comcast.net> wrote:

On Sep 19, 2011, at 6:44 PM, Jeremy Koster wrote:

Imagine that I have observed 100 people on 50 separate occasions.  For each observation, I record whether they are smoking or not.  I am interested in modeling the effect of age on the likelihood of smoking.

I could envision two ways of doing this, leaving the data in an unaggregated format -- that is, a dataset with 5000 rows.  Then specify a model with a random effect for individual, such as:

smoking.logistic <- glmer (smoking ~ age + (1|Individual), family = binomial)

Alternatively, a colleague routinely aggregates data for each individual, thus producing a dataset of 100 rows.  He then models the effect of age by writing code:

smoking.binomial <- glm (cbind(smoking observations, total observations) ~ age, family = binomial)

Wouldn't this model ignore the lack of indepedence in the observations and unfairly inflate the confidence in the estimate, since that is an ordinary grouped data input for logistic regression? I would have guessed that it would be more appropriately constructed as:

smk.pois <- glm((sum(smoking_obs) ~ age +offset(log(sum(total_obs))), family =poisson)

That way you have one observation per individual and the response is on the correct discrete scale.

I find this approach to be less intuitive, and I note that we get very different results when switching from one to the other.  I lack the statistical expertise to articulate the difference in the estimation of these models, and I would appreciate references that detail the consequences of using the different approaches.  Specifically, to what extent does the aggregation within individuals obviate the need (if at all) for an individual-level random effect?

I do not think that the second one does actually aggregate within indivduals, since you were using the binomial family.

_______________________________________________

R-sig-mixed-models at r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

David Winsemius, MD

West Hartford, CT

_______________________________________________

R-sig-mixed-models at r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models