[R-sig-ME] Sample size and mixed models

Sun Dec 14 04:37:22 CET 2008

Section 4.5.3 of Agresti's Categorical Data Analysis (pages 140-141 of the
second edition) discusses "grouped vs ungrouped" for the binomial case.
The same issue arises in Poisson models for count data.  All individuals
with the same "covariate pattern" can be collapsed to a single record, with
the sum of the counts as the response and an offset for the sample size, and
the same fitted model is obtained.

Agresti refers to two different versions of the "saturated" model, but I
like to reserve the term "saturated" for the model that fits the grouped
data perfectly and call the other the "perfect" model (since it predicts
all the individuals correctly).

Nagelkerke's R^2 will be larger when computed using the grouped data
likelihood, but that's because the "saturated" model is the definition
of perfection in that case.  This is analogous to defining the model
"y ~ factor(x)" as perfect when assessing "y ~ x" - you're throwing away
the "within groups" sum of squares and treating the "between groups" sum
of squares as the total.

Strictly speaking, the choice of which version of n to use should probably not
be made independently of this issue.  If the count of individuals is used with the
grouped data likelihood it reduces the amount by which the R^2 value is inflated,
which is my (admittedly weak) reason for the blanket recommendation.

Regards,   Rob

Andrew Robinson wrote:
> Hi Rob,
> 
> On Fri, Dec 12, 2008 at 09:21:15AM -0500, Robert Kushler wrote:
>> I would argue that the larger value (individuals) is always more appropriate
>> than the smaller value (clusters).  
> 
> That's interesting - I have the opposite response, or at least that
> the truth lies somewhere in the middle.  Can you expand on why you
> would argue that the larger value is always more appropriate than the
> smaller value?
> 
>> However, the more important issue is that the "ungrouped" version of
>> the likelihood should be used for these calculations.  Using the
>> "grouped data" likelihood omits the within cluster variation and
>> inflates the estimate of predictive power.
> 
> I don't follow what you mean by "grouped" and "ungrouped" versions of
> the likelihood.  Can you clarify?
> 
> Cheers,
> 
> Andrew
>