[R-sig-ME] Overdispersed and zero-inflated - or not - and if so, how to model them? #glmmTMB

Thu Mar 14 15:46:58 CET 2019

Dear all,

Keywords: #glmmTMB  #overdisp  #zero_count

I am grateful for this mailing list and in advance, for any helpful
response.
This e-mail has two related questions.
Details (summary, background, approach and results) are given below them.

Question 1: my data are zero-inflated and overdispersed, but what does the
overdispersion parameter in glmmTMB (genpois, negbin1, negbin2) tell me? 
	It is very high in genpois and negbin1 models (see question 2) and I
thought it should be near 1, like in negbin2 (>> 1 is overdispersed, <<1 is
underdispersed)
	But when I test these generalized models for overdispersion
(overdisp from sjstats), no overdispersion is indicated.

Question 2: should I use Gaussian on log(counts) with AIC 2068  or use
negbin2 with AIC 8036 and add overdispersion and zero-inflation models to
get a lower AIC (and if so, how?)
	When I use glmmTMB on counts with poisson, I get an AIC of 117 856.
Testing the model with overdisp and zero_count (from the sjstats package), I
find p = 0 (overdispersed) and zc-ratio 0.81 (probable zero-inflation).
	When I use glmmTMB on log10(counts), with 0's estimated to 0.1 so
resulting in -1, I get an AIC of 2068  (with lmer: 2122). Looks fine, but
may be wrong.
	When I use glmmTMB on counts with either genpois (dispersion par
613), negbinom1 (dispersion par 287) or negbinom2 (dispersion par 0.72), I
get AIC's over 8036. Much higher, but may be ok.

	My data are zero-inflated and overdispersed and I would think that
glmmTMB with generalized models would result in much better models (lower
AIC) than simply working with the log-transformed data.
	The p-values per variable are similar enough, by the way, see the
best two models at the end of this mail.
	Of course, simply transforming 0 counts into -1 at the log-level
could be the cause and this approach may oversimplify reality and the AIC of
2068 could be artificial.
	If overdispersion and zero-inflation really is necessary, do I need
to get the AIC  down from 8036 to 2068 or can I accept higher AICs? I
suppose I can.
	But then: how should I approach the development of the zi-model
and/or the overdispersion models? 
	I know, from theory, but the thing is, there is little of no
research on invertebrates in drinking water distribution systems and their
structure is so different from surface water systems, that we are developing
hypotheses from this data set.

Summary of design and model
- Invertebrates in drinking water distribution systems in The Netherlands:
1993-1995 (yes, very old data!).
- glmmTMB of multilevel model  (1 | vNr / lNr)  : 34 systems (v), 175
sampling locations (l, ~5/system), 1301 samples (~ 8 quarters from
1993-1995), a multitude of variables measured.
- One of the best model tested: lWapit (count data)  ~ pTDOC + tCa + logtMn
+ lnOType + logbS500 + bTemp + blWavlo + blRoeiNaup + blMoskr

Background
The data were collected in the '90's and basic results were published in
2012:
https://www.sciencedirect.com/science/article/abs/pii/S0043135412002217?via%
3Dihub
Dissolved organic matter is the best (causal / proxy / collinear?) predictor
for energy and carbon supply (R2 ~ 0.6 on mean estimated mean biomass at the
system level).  
I can send you the paper if you want. Also, I can sent more details, short
of the data set.
Since, when I have time (no funding), I try to find more predictors, at more
than just the highest aggregated level (system). I followed some courses on
multilevel modeling was well.
In 2013 a statistician using GenStat told me my data were zero-inflated and
overdispersed.
So, no glmm with Poisson response possible. The only option was: first a
glmm binomial for absence - presence, then glmm Poisson on the
presence-data.

The past two weeks (finally, I found some time again) I was and am so happy
to find Ben Bolker's  glmmTMB, able to work with zero-inflation and
overdispersion (I heard of MCMC options in 2017, no time then).
Learning from Ben Bolker's Salamanders-work, I managed to come a long way,
but I have not been able to develops stable overdispersion or zero-inflation
generalized models that significantly lower AIC in glmmTMB.
Although I teach the basics of statistics and made a lot of LM-models, I am
not a statistician (I'm a biologist happily forced toward statistics), and I
find a lot of details and mathematics hard to grasp:
https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html

Model and comparison approach
• System-level variable names start with p, location-level variables start
with l, t or log t, sample-level variables start with b or logb. Only
lnOType is a three types factor (bl are log-counts of other taxa)
• tCa and tMn = calcium and manganese in tap water (mean over time), lnOType
= village, city or rural environment, lbS500 = sediment > 500 um per sample,
bTemp = temperature sample
• blWavlo, blRoeipNaup, blMoskr = log(count(taxon; -1 <- 0)) per sample for
Cladocera, Copepoda, Ostracoda 
• 0-model contains no parameters (response ~ 1), 1-model contains major
predictor (pTDOC), full model contains 21 likely/possible predictors
• Model is kept identical in all regressions, although other versions may
have lower AIC
• Model data for comparison = all data  (during model development, systems
were randomly split approx. 60-40)
• I did not include overdispersion or zero-inflated models yet, as I am not
sure whether it is necessary and I cannot get the basic ones (e.g. just with
pTDOC) stable. I can imagine that adding empty ZI-models is not very
effective in countering zero-inflation

Results (I can send more details, if required)

AIC per model (dispersion only for best model: x-model)

multilevel model:  + (1|vNr / lNr) for all except lm

response = blWapit = log(count(bWapi)), where -1 <- 0)  (counts expressed
per m3)

lm		0-model	1-model	x-model	full model
Gaussian	4293.4		4014.5		3778.2		3642.9

lmer		0-model	1-model	x-model	full model
Gaussian	2185.8		2122		2121.9		2185.8

glmmTMB	0-model	1-model	x-model	full model
Gaussian	 2128.7		2116.6		2068.2		2074

response = b4Wapit = count(bWapi) expressed as rounded per 4 m3 (most sample
volumes are very close to that)

glmmTMB	Disp ratio (p)	Dispersion par	zc ratio		zi-model
0-model	1-model	x-model	full model	remarks
poisson		99.4 (0) *	NA		0.81 **		NA
137165		137157		117856		114773		* p (H0: not
overdispersed) **zero-inflation probable
genpois		0.34 (1)		613		NA		NA
8096.8		8088.1		8036.1		8042.7	
genpois (+ZI)	NA		603		NA		zi =~ 1
8094.1		8085.5		8036.6		8043.1	
trunc genpois	NA		701 (1-model)	NA		zi =~ 1
9109.7		9097.5		*		*		*with zi =
~1 or zi =~pTDOC, non-positive-definite Hessian matrix
nbinom1	0.53 (1)		287		NA		NA
8306.4 *	8297.7 *	8244.6 *	8251.8 *	* warnings:
In f(par, order = order, ...) : value out of range in 'lgamma'
nbinom1 (+ZI)	NA		287		NA		zi =~ 1
8306.4		8299.7		8246.6 *	8253.7		* warnings:
In f(par, order = order, ...) : value out of range in 'lgamma'
nbinom2	0.78		0.72		NA		NA
8224.1		8216.0		8165.3		8171.8	
nbinom2 (+ZI)	NA		0.787		NA		zi =~ 1
8226.1		8218.0		8165.3		8172.6	

Comparing the best generalized glmmTMB model (nbinom2) on counts with the
best Gaussian model on log10(counts, 0 -> -1)

Family: nbinom2  ( log )
Family: gaussian  ( identity )					
Formula:          b4Wapit ~ pTDOC + tCa + logtMn + lnOType + logbS500 +
bTemp +  	Formula:          blWapit ~ pTDOC + tCa + logtMn + lnOType +
logbS500 + bTemp +  					
    blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)
blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)

Data: AllData
Data: AllData					

     AIC      BIC   logLik deviance df.resid
AIC      BIC   logLik deviance df.resid

  8165.3   8237.7  -4068.6   8137.3     1287
2068.2   2140.6  -1020.1   2040.2     1287

Random effects:
Random effects:					

Conditional model:
Conditional model:					
 Groups  Name        Variance Std.Dev.
Groups   Name        Variance Std.Dev.					
 lNr:vNr (Intercept) 4.325    2.080
lNr:vNr  (Intercept) 0.4850   0.6964  					
 vNr     (Intercept) 5.913    2.432
vNr      (Intercept) 0.4001   0.6326  					

Residual             0.1794   0.4236  					
Number of obs: 1301, groups:  lNr:vNr, 175; vNr, 34
Number of obs: 1301, groups:  lNr:vNr, 175; vNr, 34

Overdispersion parameter for nbinom2 family (): 0.72
Dispersion estimate for gaussian family (sigma^2): 0.179

Conditional model:
Conditional model:					
	Estimate Std. Error z value Pr(>|z|)
Estimate Std. Error z value Pr(>|z|)

(Intercept) -5.25438    1.80820  -2.906 0.003662 **
(Intercept) -1.566112   0.487738  -3.211  0.00132 **

pTDOC        0.95740    0.26583   3.602 0.000316 ***
pTDOC        0.298345   0.070836   4.212 2.53e-05 ***

tCa          0.06371    0.01623   3.926 8.64e-05 ***
tCa          0.013963   0.004579   3.050  0.00229 **

logtMn       0.83018    0.49447   1.679 0.093164 .
logtMn       0.243523   0.135480   1.797  0.07226 .

lnOTypeland  0.97151    0.51131   1.900 0.057425 .
lnOTypeland  0.390505   0.152519   2.560  0.01046 *

lnOTypestad -0.72832    0.82751  -0.880 0.378788
lnOTypestad  0.112042   0.231978   0.483  0.62911

logbS500     0.44416    0.10870   4.086 4.39e-05 ***
logbS500     0.127756   0.029836   4.282 1.85e-05 ***

bTemp        0.03655    0.01301   2.810 0.004948 **
bTemp        0.007290   0.003264   2.234  0.02551 *

blWavlo      0.14475    0.04158   3.481 0.000500 ***
blWavlo      0.036470   0.011718   3.112  0.00186 **

blRoeiNaup   0.11404    0.05684   2.006 0.044818 *
blRoeiNaup   0.042573   0.015790   2.696  0.00701 **

blMoskr     -0.25836    0.11832  -2.184 0.028993 *
blMoskr     -0.066737   0.030350  -2.199  0.02788 *

---
---		
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Many thanks in advance for your help!

Kind regards,

Hein van Lieverloo

Met vriendelijke groet,

Hein van Lieverloo