[R-sig-ME] next on aggregation, nbinom1, nbinom2

Sun Oct 25 02:04:26 CET 2020

Again, I'll be happy to hear if there's a better place to post this, but
since it includes some R code and calls to glmmTMB, this mailing list 
does seem reasonably appropriate.

There are several things I find surprising about the results below.  
I'd be interested in any ideas of what could cause these things or 
reasons why I should be less surprised.

The table below shows output of four different calls of form
 glmmTMB(count~offset(log(exposure))+(1|groupid), data= ...,family= ...)
where family is either nbinom1 or nbinom2 and data is aggregated or
unaggregated.  The aggregated data was derived from the unaggregated by:

 unagg.group = group_by(unaggregated,groupid)
 aggregated = summarise(unagg.group,count=sum(count),exposure=sum(exposure))

(Note the aggregated data have exactly one observation per group.
But that's a topic that I think I at least partly understand.)

There are 5474 unaggregated observations and 78 groups.
The unaggregated observations all have close to the same exposure
(9 - 11 minutes, but exposure is measured in hours), so it does make
sense to look at the distribution of counts in that data.  
The mean of the counts is 0.0524, or .3144/hr, 
the variance is 0.315, or 1.89/hr, and the distribution matches NB
with that mean and variance surprisingly (at least to me) well.

The groups have widely varying exposures: 22 have less than 1 hr
(all of those have zero counts), 9 have 1-2 hrs, 15 have 2-5 hrs,
15 have 5-15 hrs, 15 have 15-50 hrs and 3 have over 50 hrs.
The count/exposure of the groups has mean 0.437 and variance 1.59, 
so I expected the models to predict means somewhere near the 
range .31 - .43 and variances around the range 1.59 - 1.89 
The three groups with exposure>50 hr have count/exposure .26,.30,.38

The aggregated data clearly contain much less information than the
unaggregated, so it would not be surprising if the models using the
unaggregated data were "better" in some undefined sense at describing
the world from which the data were drawn.  Whereas it WOULD be 
surprising if those models were actually worse.

The particular data that are missing from the aggregated data have to
do with the distribution of the counts and exposures (and perhaps the
count/exposure's) from which the aggregated data were derived.
I imagine this to be useful in estimating the non-mean NB parameters.
And yet there do seem to be other data from which to estimate them,
such as the relation between variance and exposure.  Possibly this is
not enough data for that purpose.  The variation among groups is used 
for the random effect size, which I expect makes it less useful for
estimating other parameters.

#    unaggregated          aggregated 
#  NB1         NB2        NB1      NB2       
#  1370.0 ~=  1372.4      257.4 << 275.9       AIC
#  0.2635     0.4604      7e-9 !   1.315       group variance
#  5.07       0.0116      8.58     0.661       overdispersion parameter
#  -1.2600    -1.3104     -1.1567  -1.3887     intercept
#  0.28        0.27       0.315    0.25        exp(intercept) => mean
#  1.7         6.5 ??     3         .35 ??     mean,overdisp => variance

1. The computed means are at least in the right ball park, but the
computed variances seem wildly off, esp. for NB2.
Am I computing the variance correctly for NB2?
mean * (1 + mean/param) ?
Why is the NB2 overdispersion parameter so different for aggregated data?
This is not supposed to depend on scale is it?
(I'm even surprised that the NB1 parameter is so different, but that's
a factor of less than 2 while NB2 is a factor of 60.)

2. Why is AIC for NB1 so much better than for NB2 in aggregated data 
when the two are so close for unaggregated data?  This seems to be 
telling us something about the aggregated data which is not evident in 
the unaggregated data.  (What could that be?)  And this contradicts my 
expectation above.

3. Is the near zero variance in NB1 aggregated data a coincidence or
does it mean something?  Surely it's not the case that all of the 
groups have the same mean.  I did check to see that, as expected,
the AIC was 2 less without the random effect.