I have a question that I think will interest many
on this list. It is not intrinsically related to
mixed models, although they may generate some
additional complications, as I discuss at the end.
Here's a toy example that illustrates the issue.
Suppose one has a data frame containing successes
associated with a binomial r.v., e.g.,
Sex Factor Weight
10 1 20 e.g., 10
boys out of 20 children in study 1
15 2 20
Call this the "AGGREGATED" form of the data. One
could proceed to analyze this as follows:
short
Call: glm(formula = Sex ~ 1, family = binomial, data = test2, weights = Total)
Coefficients:
(Intercept)
0.5108
Degrees of Freedom: 1 Total (i.e. Null); 1 Residual
Null Deviance: 2.706
Residual Deviance: 2.706 AIC: 11.37
One could look for an effect of factor
short_factor
Call: glm(formula = Sex ~ Factor, family =
binomial, data = test2, weights = Total)
Coefficients:
(Intercept) Factor2
-3.753e-16 1.099e+00
Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
Null Deviance: 2.706
Residual Deviance: 2.22e-15 AIC: 10.67
Of course, 10 boys out of 20 children means that
there were 20 Bernoulli trials, so the
"DISAGGREGATED" form of the data frame above is
Sex Factor
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
0 2
0 2
0 2
0 2
0 2
Now, we have
long
Call: glm(formula = Sex ~ 1, family = binomial, data = test)
Coefficients:
(Intercept)
0.5108
Degrees of Freedom: 39 Total (i.e. Null); 39 Residual
Null Deviance: 52.93
Residual Deviance: 52.93 AIC: 54.93
And
long_factor
Call: glm(formula = Sex ~ Factor, family = binomial, data = test)
Coefficients:
(Intercept) Factor2
7.506e-16 1.099e+00
Degrees of Freedom: 39 Total (i.e. Null); 38 Residual
Null Deviance: 52.93
Residual Deviance: 50.22 AIC: 54.22
There is nothing eyecatching here in that we get
identical estimates of the overall proportion of
boys (in the only intercept model) and identical
estimates of the factor-specific intercepts (in
the factor model) (ignoring error). In addition,
the AIC values are such that one would select the
same model regardless of whether the data were
aggregated or not.
However, it is not obvious to me that one
generally gets the "same" results regardless of
the aggregation. Therefore, my question is: is
there a preferred or canonical input form from a
statistical point of view? aggregated or
disaggregated or ? I am interested to hear about
the statistical ins and outs of this. I have
often obtained disaggregated data (as defined
above) and not thought twice about the
consequences of aggregating them for analysis (or
vice-versa).
Of course, I think the answer to my question will
be that "it depends"Š.on what one is trying to
estimate. To this extent, there may be extra
wrinkles to this issue in the context of
analyzing mixed models. Consider the following:
AGGREGATED
shortlmer
Generalized linear mixed model fit by the Laplace approximation
Formula: Sex ~ 1 + (1 | Factor)
Data: test2
AIC BIC logLik deviance
6.612 3.998 -1.306 2.612
Random effects:
Groups Name Variance Std.Dev.
Factor (Intercept) 0.074601 0.27313
Number of obs: 2, groups: Factor, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5202 0.3803 1.368 0.171
shortlmer_factor
Generalized linear mixed model fit by the Laplace approximation
Formula: Sex ~ Factor + (1 | Factor)
Data: test2
AIC BIC logLik deviance
6 2.079 -1.524e-12 3.048e-12
Random effects:
Groups Name Variance Std.Dev.
Factor (Intercept) 5.6766e-17 7.5343e-09
Number of obs: 2, groups: Factor, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.903e-07 4.472e-01 1.10e-06 1.000
Factor2 1.099e+00 6.831e-01 1.608 0.108
Correlation of Fixed Effects:
(Intr)
Factor2 -0.655
In this case, note that AIC of the constant model
is greater than the AIC of the factor model AND
the BIC of the constant model is greater than the
BIC of the factor model. So, the two criteria
tend in the same direction in terms of model
selection.
DISAGGREGATED
longlmer
Generalized linear mixed model fit by the Laplace approximation
Formula: Sex ~ 1 + (1 | Factor)
Data: test
AIC BIC logLik deviance
56.83 60.21 -26.42 52.83
Random effects:
Groups Name Variance Std.Dev.
Factor (Intercept) 0.074601 0.27313
Number of obs: 40, groups: Factor, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5201 0.3803 1.368 0.171
longlmer_factor
Generalized linear mixed model fit by the Laplace approximation
Formula: Sex ~ Factor + (1 | Factor)
Data: test
AIC BIC logLik deviance
56.22 61.29 -25.11 50.22
Random effects:
Groups Name Variance Std.Dev.
Factor (Intercept) 0 0
Number of obs: 40, groups: Factor, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.856e-07 4.472e-01 1.09e-06 1.000
Factor2 1.099e+00 6.831e-01 1.608 0.108
Correlation of Fixed Effects:
(Intr)
Factor2 -0.655
We get identical fixed-effect estimates
regardless of aggregation/disaggregation.
Note that AIC of the constant model is greater
than the AIC of the factor model, but BIC of the
constant model is LESS than the BIC of the factor
model. Of course, the differences are not great
(magnitude „ 2) but the two criteria do NOT tend
in the same direction in terms of model selection.
This is one context in which the data
aggregation/disaggregation could influence the
results of analysis even when the analysis is at
a level "above" the level of the aggregation.
Thoughts about all of this as well as pointers to
any relevant discussions in the literature would
be much appreciated.
--
Steven Orzack
The Fresh Pond Research Institute
173 Harvey Street
Cambridge, MA. 02140
617 864-4307
www.freshpond.org
[[alternative HTML version deleted]]