[R-sig-ME] aggregation of count data

Fri Oct 16 18:32:11 CEST 2020

Let me know if there's a better place to post this question.
I don't think it really has anything to do with random effects,
though I am using random effects.

I'm using models that look like this:
count ~ covariate + offset(log(exposure)) + (1 | group)

The question is about how much to aggregate the data,
i.e., two data rows with the same group and covariates, like this
group A  covariate 0  exposure 4  count 3
      A            0           5        2
could be combined into one row:
group A  covariate 0  exposure 9  count 5

I'm convinced that if I use family=poisson then it doesn't matter
whether rows are aggregated, but if I use family nbinom2 (or probably
any other than poisson) it does.  That is, the model for the aggregated
data differs from the model for the unaggregated data.

What I'm looking for is some way to decide how much aggregation to do,
i.e., which model is "best" in terms of telling me something useful
about the data, or perhaps even the world from which it was collected.
Note that "best" here does not mean anything like AIC - that will
always be lower for more aggregated data.

Here's my guess about why the results are different and which results
are likely to be "better".  I'll be interested in any insight that can
be offered by experts who read this.  Also I'd like to know if there
are any references with discussion of this problem.

First, I think the reason the results do NOT differ for poisson is
that there's only one parameter to be estimated (average
count/exposure), and that parameter is the same regardless of how the
rows are aggregated.  Whereas other distributions are also estimating
some other parameter (overdispersion) and that's different for the
aggregated data and the unaggregated data.

I suspect that there might be many different aggregation choices that
tell you different things about the data, i.e., there's no one best
answer, and you should try different choices to find different patterns
in the data.  Here's my argument.

Consider an extreme case where there are many samples for only one
group and covariate, and we combine them all into one row.  This
provides no information on the overdispersion parameter.  It might as
well be poisson.

In the case where there are many different groups or perhaps if the
covariates have enough different values, even if most of the
aggregated data points have high count and exposure, this might allow
a good estimate of the overdispersion parameter, but I'm not convinced
of that.  (Any ideas on this?)

There is an opposite extreme that seems to have the same problem.  In
the data I started with, an individual (the grouping turns out to be
by individual) was observed for some amount of time, and whenever an
event of interest was observed that was noted.  For instance the data
could have been

 started observing at 10:01:00, event at 10:03:14, event at 10:03:29,
 stopped observing 10:11:15

Let's assume that all events are recorded with timestamps at one
second granularity and that it's impossible to have two events at the
same timestamp (for the same individual).  Then the opposite extreme
would be to model every second as a data point for that individual
during which the exposure was 1 second and the count was either zero
or one.

Two lines like those aggregated above are understood by the model to
be independent observations.  Well, not quite, since they have the
same group and covariate.  But within that group and covariate, and
then accounting for exposure, they are supposed to be independent
samples drawn from whatever distribution is specified, e.g., nbinom2.

That assumption with extreme disaggregation leads again to a poisson
distribution.  So if the events are NOT independent (i.e., the
distribution is NOT poisson) then it's important to not disaggregate
too far.

In all cases the information in danger of being lost is the fact that
at SOME time scales the events are not poisson.  And you have to
sample at those time scales to realize that.  So watching an
individual for 10 minutes at a time and reporting separate data points
for each 10 minute sample will tell you something about overdispersion
at the 10 minute time scale but might not tell much about
overdispersion at the time scale of days or seconds.

The data I have is mostly collected in 10 minute chunks, which would
be good for finding differences from poisson at the 10 minute scale
(or lower by disaggregating).  The 10 minute samples can be combined
to get data for days or months, but the sampling rate is irregular,
i.e., there might be only one 10 minute sample for one month and 20
for another month.  I suspect that by doing maximum aggregation I'm
actually getting an average signal for many different time scales,
since there's a wide range of exposures.