[R-sig-ME] Overdispersed and zero-inflated - or not - and if so, how to model them? #glmmTMB
Hein van Lieverloo
he|n@v@n@||ever|oo @end|ng |rom v|@etern@@n|
Thu Mar 14 15:46:58 CET 2019
Dear all,
Keywords: #glmmTMB #overdisp #zero_count
I am grateful for this mailing list and in advance, for any helpful
response.
This e-mail has two related questions.
Details (summary, background, approach and results) are given below them.
Question 1: my data are zero-inflated and overdispersed, but what does the
overdispersion parameter in glmmTMB (genpois, negbin1, negbin2) tell me?
It is very high in genpois and negbin1 models (see question 2) and I
thought it should be near 1, like in negbin2 (>> 1 is overdispersed, <<1 is
underdispersed)
But when I test these generalized models for overdispersion
(overdisp from sjstats), no overdispersion is indicated.
Question 2: should I use Gaussian on log(counts) with AIC 2068 or use
negbin2 with AIC 8036 and add overdispersion and zero-inflation models to
get a lower AIC (and if so, how?)
When I use glmmTMB on counts with poisson, I get an AIC of 117 856.
Testing the model with overdisp and zero_count (from the sjstats package), I
find p = 0 (overdispersed) and zc-ratio 0.81 (probable zero-inflation).
When I use glmmTMB on log10(counts), with 0's estimated to 0.1 so
resulting in -1, I get an AIC of 2068 (with lmer: 2122). Looks fine, but
may be wrong.
When I use glmmTMB on counts with either genpois (dispersion par
613), negbinom1 (dispersion par 287) or negbinom2 (dispersion par 0.72), I
get AIC's over 8036. Much higher, but may be ok.
My data are zero-inflated and overdispersed and I would think that
glmmTMB with generalized models would result in much better models (lower
AIC) than simply working with the log-transformed data.
The p-values per variable are similar enough, by the way, see the
best two models at the end of this mail.
Of course, simply transforming 0 counts into -1 at the log-level
could be the cause and this approach may oversimplify reality and the AIC of
2068 could be artificial.
If overdispersion and zero-inflation really is necessary, do I need
to get the AIC down from 8036 to 2068 or can I accept higher AICs? I
suppose I can.
But then: how should I approach the development of the zi-model
and/or the overdispersion models?
I know, from theory, but the thing is, there is little of no
research on invertebrates in drinking water distribution systems and their
structure is so different from surface water systems, that we are developing
hypotheses from this data set.
Summary of design and model
- Invertebrates in drinking water distribution systems in The Netherlands:
1993-1995 (yes, very old data!).
- glmmTMB of multilevel model (1 | vNr / lNr) : 34 systems (v), 175
sampling locations (l, ~5/system), 1301 samples (~ 8 quarters from
1993-1995), a multitude of variables measured.
- One of the best model tested: lWapit (count data) ~ pTDOC + tCa + logtMn
+ lnOType + logbS500 + bTemp + blWavlo + blRoeiNaup + blMoskr
Background
The data were collected in the '90's and basic results were published in
2012:
https://www.sciencedirect.com/science/article/abs/pii/S0043135412002217?via%
3Dihub
Dissolved organic matter is the best (causal / proxy / collinear?) predictor
for energy and carbon supply (R2 ~ 0.6 on mean estimated mean biomass at the
system level).
I can send you the paper if you want. Also, I can sent more details, short
of the data set.
Since, when I have time (no funding), I try to find more predictors, at more
than just the highest aggregated level (system). I followed some courses on
multilevel modeling was well.
In 2013 a statistician using GenStat told me my data were zero-inflated and
overdispersed.
So, no glmm with Poisson response possible. The only option was: first a
glmm binomial for absence - presence, then glmm Poisson on the
presence-data.
The past two weeks (finally, I found some time again) I was and am so happy
to find Ben Bolker's glmmTMB, able to work with zero-inflation and
overdispersion (I heard of MCMC options in 2017, no time then).
Learning from Ben Bolker's Salamanders-work, I managed to come a long way,
but I have not been able to develops stable overdispersion or zero-inflation
generalized models that significantly lower AIC in glmmTMB.
Although I teach the basics of statistics and made a lot of LM-models, I am
not a statistician (I'm a biologist happily forced toward statistics), and I
find a lot of details and mathematics hard to grasp:
https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html
Model and comparison approach
• System-level variable names start with p, location-level variables start
with l, t or log t, sample-level variables start with b or logb. Only
lnOType is a three types factor (bl are log-counts of other taxa)
• tCa and tMn = calcium and manganese in tap water (mean over time), lnOType
= village, city or rural environment, lbS500 = sediment > 500 um per sample,
bTemp = temperature sample
• blWavlo, blRoeipNaup, blMoskr = log(count(taxon; -1 <- 0)) per sample for
Cladocera, Copepoda, Ostracoda
• 0-model contains no parameters (response ~ 1), 1-model contains major
predictor (pTDOC), full model contains 21 likely/possible predictors
• Model is kept identical in all regressions, although other versions may
have lower AIC
• Model data for comparison = all data (during model development, systems
were randomly split approx. 60-40)
• I did not include overdispersion or zero-inflated models yet, as I am not
sure whether it is necessary and I cannot get the basic ones (e.g. just with
pTDOC) stable. I can imagine that adding empty ZI-models is not very
effective in countering zero-inflation
Results (I can send more details, if required)
AIC per model (dispersion only for best model: x-model)
multilevel model: + (1|vNr / lNr) for all except lm
response = blWapit = log(count(bWapi)), where -1 <- 0) (counts expressed
per m3)
lm 0-model 1-model x-model full model
Gaussian 4293.4 4014.5 3778.2 3642.9
lmer 0-model 1-model x-model full model
Gaussian 2185.8 2122 2121.9 2185.8
glmmTMB 0-model 1-model x-model full model
Gaussian 2128.7 2116.6 2068.2 2074
response = b4Wapit = count(bWapi) expressed as rounded per 4 m3 (most sample
volumes are very close to that)
glmmTMB Disp ratio (p) Dispersion par zc ratio zi-model
0-model 1-model x-model full model remarks
poisson 99.4 (0) * NA 0.81 ** NA
137165 137157 117856 114773 * p (H0: not
overdispersed) **zero-inflation probable
genpois 0.34 (1) 613 NA NA
8096.8 8088.1 8036.1 8042.7
genpois (+ZI) NA 603 NA zi =~ 1
8094.1 8085.5 8036.6 8043.1
trunc genpois NA 701 (1-model) NA zi =~ 1
9109.7 9097.5 * * *with zi =
~1 or zi =~pTDOC, non-positive-definite Hessian matrix
nbinom1 0.53 (1) 287 NA NA
8306.4 * 8297.7 * 8244.6 * 8251.8 * * warnings:
In f(par, order = order, ...) : value out of range in 'lgamma'
nbinom1 (+ZI) NA 287 NA zi =~ 1
8306.4 8299.7 8246.6 * 8253.7 * warnings:
In f(par, order = order, ...) : value out of range in 'lgamma'
nbinom2 0.78 0.72 NA NA
8224.1 8216.0 8165.3 8171.8
nbinom2 (+ZI) NA 0.787 NA zi =~ 1
8226.1 8218.0 8165.3 8172.6
Comparing the best generalized glmmTMB model (nbinom2) on counts with the
best Gaussian model on log10(counts, 0 -> -1)
Family: nbinom2 ( log )
Family: gaussian ( identity )
Formula: b4Wapit ~ pTDOC + tCa + logtMn + lnOType + logbS500 +
bTemp + Formula: blWapit ~ pTDOC + tCa + logtMn + lnOType +
logbS500 + bTemp +
blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)
blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)
Data: AllData
Data: AllData
AIC BIC logLik deviance df.resid
AIC BIC logLik deviance df.resid
8165.3 8237.7 -4068.6 8137.3 1287
2068.2 2140.6 -1020.1 2040.2 1287
Random effects:
Random effects:
Conditional model:
Conditional model:
Groups Name Variance Std.Dev.
Groups Name Variance Std.Dev.
lNr:vNr (Intercept) 4.325 2.080
lNr:vNr (Intercept) 0.4850 0.6964
vNr (Intercept) 5.913 2.432
vNr (Intercept) 0.4001 0.6326
Residual 0.1794 0.4236
Number of obs: 1301, groups: lNr:vNr, 175; vNr, 34
Number of obs: 1301, groups: lNr:vNr, 175; vNr, 34
Overdispersion parameter for nbinom2 family (): 0.72
Dispersion estimate for gaussian family (sigma^2): 0.179
Conditional model:
Conditional model:
Estimate Std. Error z value Pr(>|z|)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.25438 1.80820 -2.906 0.003662 **
(Intercept) -1.566112 0.487738 -3.211 0.00132 **
pTDOC 0.95740 0.26583 3.602 0.000316 ***
pTDOC 0.298345 0.070836 4.212 2.53e-05 ***
tCa 0.06371 0.01623 3.926 8.64e-05 ***
tCa 0.013963 0.004579 3.050 0.00229 **
logtMn 0.83018 0.49447 1.679 0.093164 .
logtMn 0.243523 0.135480 1.797 0.07226 .
lnOTypeland 0.97151 0.51131 1.900 0.057425 .
lnOTypeland 0.390505 0.152519 2.560 0.01046 *
lnOTypestad -0.72832 0.82751 -0.880 0.378788
lnOTypestad 0.112042 0.231978 0.483 0.62911
logbS500 0.44416 0.10870 4.086 4.39e-05 ***
logbS500 0.127756 0.029836 4.282 1.85e-05 ***
bTemp 0.03655 0.01301 2.810 0.004948 **
bTemp 0.007290 0.003264 2.234 0.02551 *
blWavlo 0.14475 0.04158 3.481 0.000500 ***
blWavlo 0.036470 0.011718 3.112 0.00186 **
blRoeiNaup 0.11404 0.05684 2.006 0.044818 *
blRoeiNaup 0.042573 0.015790 2.696 0.00701 **
blMoskr -0.25836 0.11832 -2.184 0.028993 *
blMoskr -0.066737 0.030350 -2.199 0.02788 *
---
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Many thanks in advance for your help!
Kind regards,
Hein van Lieverloo
Met vriendelijke groet,
Hein van Lieverloo
More information about the R-sig-mixed-models
mailing list