[R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation

Douglas Bates bates at stat.wisc.edu
Thu Sep 22 21:36:47 CEST 2011


On Thu, Sep 22, 2011 at 12:05 PM, Rune Haubo <rhbc at imm.dtu.dk> wrote:
> On 21 September 2011 23:56, Jeremy Koster <helixed2 at yahoo.com> wrote:
>> Thanks to Ben, Thomas, and David for the responses.  As usual, the explanations didn't really sink in until I experimented with different formulations of models.
>>
>> Somewhat surprisingly, I discovered that lme4 can handle the following syntax for a dataset in which observations have been aggregated into rows by individuals (i.e., n = 100 rows):
>>
>> smoking.aggregated <- glmer (cbind(smoking observations, total observations) ~ AGE + (1|Individual), family = binomial, data = aggregated)
>>
>> which produces the same estimates of the intercept, AGE, and Individual-level variance as the following code, which refers to an unaggregated dataset of 5000 rows (100 individuals with 50 observations each), with a binary variable for smoking (or not):
>>
>> smoking.unaggregated <- glmer (smoking ~ AGE + (1|Individual), family = binomial, data = unaggregated)
>>
>> Where the models differ is the value for the log-likelihood (and AIC).  Considering that the estimates for covariates and random effects were identical in both models, my first guess was that lme4 was basically treating the aggregated data like an unaggregated dataset.
>>
>> Why, then, does the log-likelihood differ (substantially)?  What are the implications for model selection, using AIC, for example?  Is it possible that one might choose different models depending on whether the data have been aggregated or not?

This difference is the reason for several peculiarities in R's glm
families.  If you check closely the binomial family defines a numeric
vector called 'n' which is the number of observations in the binomial
random variable at each observation.  At one time it was only the
binomial family that defined this, even though it was an argument to
the AIC member function for each of the families.  The other families
could get away with not defining it because it wasn't used in the AIC
member function and lazy evaluation allows you to leave an unused
argument undefined.  (Eventually Martin Maechler objected to this
sloppy programming practice and added a definition of 'n' as,
effectively, rep(1, length(y)) to all the other families.)

Why is 'n' an argument to the AIC member function but not to the
deviance residuals member function, dev.resids?  Well the deviance
residuals whose sum is the "deviance" don't represent the deviance in
this case.  (Some people would claim that they do if you simply define
the deviance as the difference between the current model and the
saturated model.)  So for every other model (except the linear model
where the deviance is defined as the residual sum of squares) the AIC
is the deviance + 2 * p (I think that's right, correct me if I am
wrong) where p is the number of parameters in the model.  For a
binomial glm the sum of the deviance residuals isn't really the
deviance but and the value of the aic member function isn't really the
AIC.  Of course, the value of the aic member function isn't the AIC
for any of the glm families, it's the deviance.

> binomial()$aic
function (y, n, mu, wt, dev)
{
    m <- if (any(n > 1))
        n
    else wt
    -2 * sum(ifelse(m > 0, (wt/m), 0) * dbinom(round(m * y),
        round(m), mu, log = TRUE))
}
> poisson()$aic
function (y, n, mu, wt, dev)
-2 * sum(dpois(y, mu, log = TRUE) * wt)

If this is all starting to smell like a hack to you, that's because it
is one.  (Notice that you could really shoot yourself in the foot if
you happened to use weights for a binomial glm that do not give the
expected integer values for m * y. And notice that both of those aic
functions never use the dev argument.  That argument is used only in
the gaussian() family where, as mentioned above, the deviance
residuals are different than in all the other families.)

I imagine that if we used the deviance as defined in the aic member
function (which returns the deviance, not the AIC) we would get the
same answers for the log-likelihood from the aggregated and
non-aggregated data.


> As long as you stay with the same aggregation of the data there are no
> implications for model selection using AIC as Ben explained. The
> really interesting question (to me at least) is how to deal with BIC
> where the penalty term depends on the number of observations. BIC can
> actually choose different models depending on whether you aggregate
> the data or not. That brings up the question of whether to count the
> number of observations as the number of binomial observations (rows in
> X) as most people seem to be doing it, or as the number of Bernoulli
> trials (sum of case weights). Since data carries the same information
> whether they are aggregated or not, it is somewhat unfortunate that
> BIC can depends on it.
>
> Cheers
> Rune
>
>>
>>
>> --- On Tue, 9/20/11, Thomas Levine <tkl22 at cornell.edu> wrote:
>>
>> From: Thomas Levine <tkl22 at cornell.edu>
>> Subject: Re: [R-sig-ME] Binomial vs. logistic regression & the consequences of aggregation
>> To: "David Winsemius" <dwinsemius at comcast.net>
>> Cc: "Jeremy Koster" <helixed2 at yahoo.com>, r-sig-mixed-models at r-project.org
>> Date: Tuesday, September 20, 2011, 12:48 AM
>>
>> I believe that your approach is appropriate and that your colleague's approach is strange for the reason that David gave.
>> The non-hierarchical models run much faster, and I suspect that this is why your colleague is using them. Even if you have the power to run the hierarchical models, your colleague may have learned methods that were used before hierarchical modeling was feasible.
>>
>>
>> A simulation may help you understand this better. Here's a start at that.
>> Tom
>>
>>
>> On Mon, Sep 19, 2011 at 10:55 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>>
>>
>>
>>
>> On Sep 19, 2011, at 6:44 PM, Jeremy Koster wrote:
>>
>>
>>
>>
>> Imagine that I have observed 100 people on 50 separate occasions.  For each observation, I record whether they are smoking or not.  I am interested in modeling the effect of age on the likelihood of smoking.
>>
>>
>>
>> I could envision two ways of doing this, leaving the data in an unaggregated format -- that is, a dataset with 5000 rows.  Then specify a model with a random effect for individual, such as:
>>
>>
>>
>> smoking.logistic <- glmer (smoking ~ age + (1|Individual), family = binomial)
>>
>>
>>
>> Alternatively, a colleague routinely aggregates data for each individual, thus producing a dataset of 100 rows.  He then models the effect of age by writing code:
>>
>>
>>
>> smoking.binomial <- glm (cbind(smoking observations, total observations) ~ age, family = binomial)
>>
>>
>>
>>
>> Wouldn't this model ignore the lack of indepedence in the observations and unfairly inflate the confidence in the estimate, since that is an ordinary grouped data input for logistic regression? I would have guessed that it would be more appropriately constructed as:
>>
>>
>>
>>
>>
>> smk.pois <- glm((sum(smoking_obs) ~ age +offset(log(sum(total_obs))), family =poisson)
>>
>>
>>
>> That way you have one observation per individual and the response is on the correct discrete scale.
>>
>>
>>
>>
>>
>>
>> I find this approach to be less intuitive, and I note that we get very different results when switching from one to the other.  I lack the statistical expertise to articulate the difference in the estimation of these models, and I would appreciate references that detail the consequences of using the different approaches.  Specifically, to what extent does the aggregation within individuals obviate the need (if at all) for an individual-level random effect?
>>
>>
>>
>>
>>
>>
>> I do not think that the second one does actually aggregate within indivduals, since you were using the binomial family.
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> R-sig-mixed-models at r-project.org mailing list
>>
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>>
>>
>>
>> David Winsemius, MD
>>
>> West Hartford, CT
>>
>>
>>
>> _______________________________________________
>>
>> R-sig-mixed-models at r-project.org mailing list
>>
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>>
>>
>>
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>
>
>
> --
> Rune Haubo Bojesen Christensen
>
> Ph.D. Student, M.Sc. Eng.
> Phone: (+45) 45 25 33 63
> Mobile: (+45) 30 26 45 54
>
> DTU Informatics, Section for Statistics
> Technical University of Denmark, Build. 305, Room 122,
> DK-2800 Kgs. Lyngby, Denmark
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>




More information about the R-sig-mixed-models mailing list