[R-sig-ME] Overdispersed and zero-inflated - or not - and if so, how to model them? #glmmTMB

Thu Mar 14 16:34:20 CET 2019

Dear Hein,

See replies below...

> On 14Mar 2019, at 15:46, Hein van Lieverloo <hein.van.lieverloo using viaeterna.nl> wrote:
> 
> Dear all,
> 
> Keywords: #glmmTMB  #overdisp  #zero_count
> 
> I am grateful for this mailing list and in advance, for any helpful
> response.
> This e-mail has two related questions.
> Details (summary, background, approach and results) are given below them.
> 
> Question 1: my data are zero-inflated and overdispersed, but what does the
> overdispersion parameter in glmmTMB (genpois, negbin1, negbin2) tell me? 
> 	It is very high in genpois and negbin1 models (see question 2) and I
> thought it should be near 1, like in negbin2 (>> 1 is overdispersed, <<1 is
> underdispersed)
> 	But when I test these generalized models for overdispersion
> (overdisp from sjstats), no overdispersion is indicated.

The dispersion parameter in a glmmTMB model is there to handle the dispersion and it’s fine if it’s different from 1. So your tests with sjstats seemed to be correct. For descriptions of how the dispersion parameters relate to the variance, see ?sigma.glmmTMB

> 
> Question 2: should I use Gaussian on log(counts) with AIC 2068  or use
> negbin2 with AIC 8036 and add overdispersion and zero-inflation models to
> get a lower AIC (and if so, how?)
> 	When I use glmmTMB on counts with poisson, I get an AIC of 117 856.
> Testing the model with overdisp and zero_count (from the sjstats package), I
> find p = 0 (overdispersed) and zc-ratio 0.81 (probable zero-inflation).
> 	When I use glmmTMB on log10(counts), with 0's estimated to 0.1 so
> resulting in -1, I get an AIC of 2068  (with lmer: 2122). Looks fine, but
> may be wrong.
> 	When I use glmmTMB on counts with either genpois (dispersion par
> 613), negbinom1 (dispersion par 287) or negbinom2 (dispersion par 0.72), I
> get AIC's over 8036. Much higher, but may be ok.

You can’t compare the models of the log-transformed data to the raw data. For example,
> set.seed(1)
> x=rpois(100, lambda=5)
> AIC(glmmTMB(log(x)~1))
[1] 128.0742
> AIC(glmmTMB(x~1, family=poisson))
[1] 422.911

or see discussion here https://stats.stackexchange.com/questions/61332/comparing-aic-of-a-model-and-its-log-transformed-version

> 
> 	My data are zero-inflated and overdispersed and I would think that
> glmmTMB with generalized models would result in much better models (lower
> AIC) than simply working with the log-transformed data.
> 	The p-values per variable are similar enough, by the way, see the
> best two models at the end of this mail.
> 	Of course, simply transforming 0 counts into -1 at the log-level
> could be the cause and this approach may oversimplify reality and the AIC of
> 2068 could be artificial.
> 	If overdispersion and zero-inflation really is necessary, do I need
> to get the AIC  down from 8036 to 2068 or can I accept higher AICs? I
> suppose I can.
> 	But then: how should I approach the development of the zi-model
> and/or the overdispersion models? 
> 	I know, from theory, but the thing is, there is little of no
> research on invertebrates in drinking water distribution systems and their
> structure is so different from surface water systems, that we are developing
> hypotheses from this data set.
> 
> 
> Summary of design and model
> - Invertebrates in drinking water distribution systems in The Netherlands:
> 1993-1995 (yes, very old data!).
> - glmmTMB of multilevel model  (1 | vNr / lNr)  : 34 systems (v), 175
> sampling locations (l, ~5/system), 1301 samples (~ 8 quarters from
> 1993-1995), a multitude of variables measured.
> - One of the best model tested: lWapit (count data)  ~ pTDOC + tCa + logtMn
> + lnOType + logbS500 + bTemp + blWavlo + blRoeiNaup + blMoskr
> 
> 
> Background
> The data were collected in the '90's and basic results were published in
> 2012:
> https://www.sciencedirect.com/science/article/abs/pii/S0043135412002217?via%
> 3Dihub
> Dissolved organic matter is the best (causal / proxy / collinear?) predictor
> for energy and carbon supply (R2 ~ 0.6 on mean estimated mean biomass at the
> system level).  
> I can send you the paper if you want. Also, I can sent more details, short
> of the data set.
> Since, when I have time (no funding), I try to find more predictors, at more
> than just the highest aggregated level (system). I followed some courses on
> multilevel modeling was well.
> In 2013 a statistician using GenStat told me my data were zero-inflated and
> overdispersed.
> So, no glmm with Poisson response possible. The only option was: first a
> glmm binomial for absence - presence, then glmm Poisson on the
> presence-data.
> 
> The past two weeks (finally, I found some time again) I was and am so happy
> to find Ben Bolker's  glmmTMB, able to work with zero-inflation and
> overdispersion (I heard of MCMC options in 2017, no time then).
> Learning from Ben Bolker's Salamanders-work, I managed to come a long way,
> but I have not been able to develops stable overdispersion or zero-inflation
> generalized models that significantly lower AIC in glmmTMB.
> Although I teach the basics of statistics and made a lot of LM-models, I am
> not a statistician (I'm a biologist happily forced toward statistics), and I
> find a lot of details and mathematics hard to grasp:
> https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html
> 

I’m glad that glmmTMB is solving some of your long-standing problems and I agree that Ben Bolker has contributed immensely to glmmTMB and GLMMs in general, but calling it "Ben Bolker’s glmmTMB" is disregarding the other developers of the package and documentation.

> 
> 
> Model and comparison approach
> • System-level variable names start with p, location-level variables start
> with l, t or log t, sample-level variables start with b or logb. Only
> lnOType is a three types factor (bl are log-counts of other taxa)
> • tCa and tMn = calcium and manganese in tap water (mean over time), lnOType
> = village, city or rural environment, lbS500 = sediment > 500 um per sample,
> bTemp = temperature sample
> • blWavlo, blRoeipNaup, blMoskr = log(count(taxon; -1 <- 0)) per sample for
> Cladocera, Copepoda, Ostracoda 
> • 0-model contains no parameters (response ~ 1), 1-model contains major
> predictor (pTDOC), full model contains 21 likely/possible predictors
> • Model is kept identical in all regressions, although other versions may
> have lower AIC
> • Model data for comparison = all data  (during model development, systems
> were randomly split approx. 60-40)

Make sure you’re using the same data for all models in AIC comparisons. 

> • I did not include overdispersion or zero-inflated models yet, as I am not
> sure whether it is necessary and I cannot get the basic ones (e.g. just with
> pTDOC) stable. I can imagine that adding empty ZI-models is not very
> effective in countering zero-inflation
> 

For people in your situation, I typically recommend fitting a negative binomial model (see Warton, Environmetrics 2005), then testing for zero-inflation (I typically use DHARMa, but it sounds like sjstats does this also). Then if you have zero-inflation, you could fit a zero-inflated negative binomial. Then if the nbinom2 dispersion parameter in the conditional model gets very large, it means you might as well use a zero-inflated Poisson (see nbinom2 in ?sigma.glmmTMB for the reason). However, the best distribution could change depending on the predictors in the model because a model that explains less of the variance might have more dispersion. As you saw in the salamander examples (Brooks et al. 2017, R Journal, Appendix A), you can try different zero-inflation models.

cheers,
Mollie

> 
> Results (I can send more details, if required)
> 
> AIC per model (dispersion only for best model: x-model)
> 
> multilevel model:  + (1|vNr / lNr) for all except lm
> 
> 	
> 
> response = blWapit = log(count(bWapi)), where -1 <- 0)  (counts expressed
> per m3)
> 
> lm		0-model	1-model	x-model	full model
> Gaussian	4293.4		4014.5		3778.2		3642.9
> 
> 	
> 
> lmer		0-model	1-model	x-model	full model
> Gaussian	2185.8		2122		2121.9		2185.8
> 
> 	
> 
> glmmTMB	0-model	1-model	x-model	full model
> Gaussian	 2128.7		2116.6		2068.2		2074
> 
> 	
> 
> response = b4Wapit = count(bWapi) expressed as rounded per 4 m3 (most sample
> volumes are very close to that)
> 
> glmmTMB	Disp ratio (p)	Dispersion par	zc ratio		zi-model
> 0-model	1-model	x-model	full model	remarks
> poisson		99.4 (0) *	NA		0.81 **		NA
> 137165		137157		117856		114773		* p (H0: not
> overdispersed) **zero-inflation probable
> genpois		0.34 (1)		613		NA		NA
> 8096.8		8088.1		8036.1		8042.7	
> genpois (+ZI)	NA		603		NA		zi =~ 1
> 8094.1		8085.5		8036.6		8043.1	
> trunc genpois	NA		701 (1-model)	NA		zi =~ 1
> 9109.7		9097.5		*		*		*with zi =
> ~1 or zi =~pTDOC, non-positive-definite Hessian matrix
> nbinom1	0.53 (1)		287		NA		NA
> 8306.4 *	8297.7 *	8244.6 *	8251.8 *	* warnings:
> In f(par, order = order, ...) : value out of range in 'lgamma'
> nbinom1 (+ZI)	NA		287		NA		zi =~ 1
> 8306.4		8299.7		8246.6 *	8253.7		* warnings:
> In f(par, order = order, ...) : value out of range in 'lgamma'
> nbinom2	0.78		0.72		NA		NA
> 8224.1		8216.0		8165.3		8171.8	
> nbinom2 (+ZI)	NA		0.787		NA		zi =~ 1
> 8226.1		8218.0		8165.3		8172.6	
> 
> 
> Comparing the best generalized glmmTMB model (nbinom2) on counts with the
> best Gaussian model on log10(counts, 0 -> -1)
> 
> Family: nbinom2  ( log )
> Family: gaussian  ( identity )					
> Formula:          b4Wapit ~ pTDOC + tCa + logtMn + lnOType + logbS500 +
> bTemp +  	Formula:          blWapit ~ pTDOC + tCa + logtMn + lnOType +
> logbS500 + bTemp +  					
>    blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)
> blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)
> 
> Data: AllData
> Data: AllData					
> 	
> 
>     AIC      BIC   logLik deviance df.resid
> AIC      BIC   logLik deviance df.resid
> 
>  8165.3   8237.7  -4068.6   8137.3     1287
> 2068.2   2140.6  -1020.1   2040.2     1287
> 
> 	
> 
> Random effects:
> Random effects:					
> 	
> 
> Conditional model:
> Conditional model:					
> Groups  Name        Variance Std.Dev.
> Groups   Name        Variance Std.Dev.					
> lNr:vNr (Intercept) 4.325    2.080
> lNr:vNr  (Intercept) 0.4850   0.6964  					
> vNr     (Intercept) 5.913    2.432
> vNr      (Intercept) 0.4001   0.6326  					
> 	
> Residual             0.1794   0.4236  					
> Number of obs: 1301, groups:  lNr:vNr, 175; vNr, 34
> Number of obs: 1301, groups:  lNr:vNr, 175; vNr, 34
> 
> 	
> 
> Overdispersion parameter for nbinom2 family (): 0.72
> Dispersion estimate for gaussian family (sigma^2): 0.179
> 
> 	
> 
> Conditional model:
> Conditional model:					
> 	Estimate Std. Error z value Pr(>|z|)
> Estimate Std. Error z value Pr(>|z|)
> 
> (Intercept) -5.25438    1.80820  -2.906 0.003662 **
> (Intercept) -1.566112   0.487738  -3.211  0.00132 **
> 
> pTDOC        0.95740    0.26583   3.602 0.000316 ***
> pTDOC        0.298345   0.070836   4.212 2.53e-05 ***
> 
> tCa          0.06371    0.01623   3.926 8.64e-05 ***
> tCa          0.013963   0.004579   3.050  0.00229 **
> 
> logtMn       0.83018    0.49447   1.679 0.093164 .
> logtMn       0.243523   0.135480   1.797  0.07226 .
> 
> lnOTypeland  0.97151    0.51131   1.900 0.057425 .
> lnOTypeland  0.390505   0.152519   2.560  0.01046 *
> 
> lnOTypestad -0.72832    0.82751  -0.880 0.378788
> lnOTypestad  0.112042   0.231978   0.483  0.62911
> 
> logbS500     0.44416    0.10870   4.086 4.39e-05 ***
> logbS500     0.127756   0.029836   4.282 1.85e-05 ***
> 
> bTemp        0.03655    0.01301   2.810 0.004948 **
> bTemp        0.007290   0.003264   2.234  0.02551 *
> 
> blWavlo      0.14475    0.04158   3.481 0.000500 ***
> blWavlo      0.036470   0.011718   3.112  0.00186 **
> 
> blRoeiNaup   0.11404    0.05684   2.006 0.044818 *
> blRoeiNaup   0.042573   0.015790   2.696  0.00701 **
> 
> blMoskr     -0.25836    0.11832  -2.184 0.028993 *
> blMoskr     -0.066737   0.030350  -2.199  0.02788 *
> 
> ---
> ---		
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> 
> 
> Many thanks in advance for your help!
> 
> Kind regards,
> 
> Hein van Lieverloo
> 
> 
> 
> Met vriendelijke groet,
> 
> Hein van Lieverloo
> 
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

———————————
Mollie E. Brooks, Ph.D.
Research Scientist
National Institute of Aquatic Resources
Technical University of Denmark

	[[alternative HTML version deleted]]