[R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation
Rune Haubo
rhbc at imm.dtu.dk
Thu Sep 22 19:05:02 CEST 2011
On 21 September 2011 23:56, Jeremy Koster <helixed2 at yahoo.com> wrote:
> Thanks to Ben, Thomas, and David for the responses. As usual, the explanations didn't really sink in until I experimented with different formulations of models.
>
> Somewhat surprisingly, I discovered that lme4 can handle the following syntax for a dataset in which observations have been aggregated into rows by individuals (i.e., n = 100 rows):
>
> smoking.aggregated <- glmer (cbind(smoking observations, total observations) ~ AGE + (1|Individual), family = binomial, data = aggregated)
>
> which produces the same estimates of the intercept, AGE, and Individual-level variance as the following code, which refers to an unaggregated dataset of 5000 rows (100 individuals with 50 observations each), with a binary variable for smoking (or not):
>
> smoking.unaggregated <- glmer (smoking ~ AGE + (1|Individual), family = binomial, data = unaggregated)
>
> Where the models differ is the value for the log-likelihood (and AIC). Considering that the estimates for covariates and random effects were identical in both models, my first guess was that lme4 was basically treating the aggregated data like an unaggregated dataset.
>
> Why, then, does the log-likelihood differ (substantially)? What are the implications for model selection, using AIC, for example? Is it possible that one might choose different models depending on whether the data have been aggregated or not?
As long as you stay with the same aggregation of the data there are no
implications for model selection using AIC as Ben explained. The
really interesting question (to me at least) is how to deal with BIC
where the penalty term depends on the number of observations. BIC can
actually choose different models depending on whether you aggregate
the data or not. That brings up the question of whether to count the
number of observations as the number of binomial observations (rows in
X) as most people seem to be doing it, or as the number of Bernoulli
trials (sum of case weights). Since data carries the same information
whether they are aggregated or not, it is somewhat unfortunate that
BIC can depends on it.
Cheers
Rune
>
>
> --- On Tue, 9/20/11, Thomas Levine <tkl22 at cornell.edu> wrote:
>
> From: Thomas Levine <tkl22 at cornell.edu>
> Subject: Re: [R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation
> To: "David Winsemius" <dwinsemius at comcast.net>
> Cc: "Jeremy Koster" <helixed2 at yahoo.com>, r-sig-mixed-models at r-project.org
> Date: Tuesday, September 20, 2011, 12:48 AM
>
> I believe that your approach is appropriate and that your colleague's approach is strange for the reason that David gave.
> The non-hierarchical models run much faster, and I suspect that this is why your colleague is using them. Even if you have the power to run the hierarchical models, your colleague may have learned methods that were used before hierarchical modeling was feasible.
>
>
> A simulation may help you understand this better. Here's a start at that.
> Tom
>
>
> On Mon, Sep 19, 2011 at 10:55 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
>
>
>
> On Sep 19, 2011, at 6:44 PM, Jeremy Koster wrote:
>
>
>
>
> Imagine that I have observed 100 people on 50 separate occasions. For each observation, I record whether they are smoking or not. I am interested in modeling the effect of age on the likelihood of smoking.
>
>
>
> I could envision two ways of doing this, leaving the data in an unaggregated format -- that is, a dataset with 5000 rows. Then specify a model with a random effect for individual, such as:
>
>
>
> smoking.logistic <- glmer (smoking ~ age + (1|Individual), family = binomial)
>
>
>
> Alternatively, a colleague routinely aggregates data for each individual, thus producing a dataset of 100 rows. He then models the effect of age by writing code:
>
>
>
> smoking.binomial <- glm (cbind(smoking observations, total observations) ~ age, family = binomial)
>
>
>
>
> Wouldn't this model ignore the lack of indepedence in the observations and unfairly inflate the confidence in the estimate, since that is an ordinary grouped data input for logistic regression? I would have guessed that it would be more appropriately constructed as:
>
>
>
>
>
> smk.pois <- glm((sum(smoking_obs) ~ age +offset(log(sum(total_obs))), family =poisson)
>
>
>
> That way you have one observation per individual and the response is on the correct discrete scale.
>
>
>
>
>
>
> I find this approach to be less intuitive, and I note that we get very different results when switching from one to the other. I lack the statistical expertise to articulate the difference in the estimation of these models, and I would appreciate references that detail the consequences of using the different approaches. Specifically, to what extent does the aggregation within individuals obviate the need (if at all) for an individual-level random effect?
>
>
>
>
>
>
> I do not think that the second one does actually aggregate within indivduals, since you were using the binomial family.
>
>
>
>
>
>
> _______________________________________________
>
> R-sig-mixed-models at r-project.org mailing list
>
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
>
>
> David Winsemius, MD
>
> West Hartford, CT
>
>
>
> _______________________________________________
>
> R-sig-mixed-models at r-project.org mailing list
>
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
>
>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
--
Rune Haubo Bojesen Christensen
Ph.D. Student, M.Sc. Eng.
Phone: (+45) 45 25 33 63
Mobile: (+45) 30 26 45 54
DTU Informatics, Section for Statistics
Technical University of Denmark, Build. 305, Room 122,
DK-2800 Kgs. Lyngby, Denmark
More information about the R-sig-mixed-models
mailing list