[R-sig-ME] Overdispersed and zero-inflated - or not - and if so, how to model them? #glmmTMB

Thu Mar 14 19:16:47 CET 2019

Dear Mollie,

- Thank you so, so much for your great and swift reply!

- I understand now the meaning of the dispersion parameters in glmmTMB. 

- DHARMa is a great tool box, with testZeroInflation(simulateResiduals(model)) even better than zero_count from sjstats, as it also works with non-Poisson models and provides great simulation frequency distributions.

- After using this tool, it seems I do not have to use ZI-models, nor dispersion models when I'm working with the genpois model in glmmTMB. This makes life so much easier!

- I will study the information you sent on AIC-comparison between counts and log(counts). For now, I can accept the much higher AIC (8036) with the glmmTMB-genpois on counts compared to the 2068 with glmmTMB-gaussian with log10(counts, 1 <- 0)).

- Sorry to everyone in the glmmTMB development team for thinking Ben Bolker developed glmmTMB, I should have studied the package details. Especially as I saw Mollie's name in the Description file just now. I do not there is a plain text (nor the HTML) code for the 'foot in mouth'-emoji, but I assure you it is here ^1000 ....

PS I responded to Mollie earlier in more detail. I don't know the mailing etiquette and I am happy to send this more detailed mail in this group.

Kind regards,

Hein 

From: Mollie Brooks <mollieebrooks using gmail.com> 
Sent: donderdag 14 maart 2019 16:34
To: Hein van Lieverloo <hein.van.lieverloo using viaeterna.nl>
Cc: R-sig-mixed-models using r-project.org
Subject: Re: [R-sig-ME] Overdispersed and zero-inflated - or not - and if so, how to model them? #glmmTMB

Dear Hein,

See replies below...

On 14Mar 2019, at 15:46, Hein van Lieverloo < <mailto:hein.van.lieverloo using viaeterna.nl> hein.van.lieverloo using viaeterna.nl> wrote:

Dear all,

Keywords: #glmmTMB  #overdisp  #zero_count

I am grateful for this mailing list and in advance, for any helpful
response.
This e-mail has two related questions.
Details (summary, background, approach and results) are given below them.

Question 1: my data are zero-inflated and overdispersed, but what does the
overdispersion parameter in glmmTMB (genpois, negbin1, negbin2) tell me? 
               It is very high in genpois and negbin1 models (see question 2) and I
thought it should be near 1, like in negbin2 (>> 1 is overdispersed, <<1 is
underdispersed)
               But when I test these generalized models for overdispersion
(overdisp from sjstats), no overdispersion is indicated.

The dispersion parameter in a glmmTMB model is there to handle the dispersion and it’s fine if it’s different from 1. So your tests with sjstats seemed to be correct. For descriptions of how the dispersion parameters relate to the variance, see ?sigma.glmmTMB

Question 2: should I use Gaussian on log(counts) with AIC 2068  or use
negbin2 with AIC 8036 and add overdispersion and zero-inflation models to
get a lower AIC (and if so, how?)
               When I use glmmTMB on counts with poisson, I get an AIC of 117 856.
Testing the model with overdisp and zero_count (from the sjstats package), I
find p = 0 (overdispersed) and zc-ratio 0.81 (probable zero-inflation).
               When I use glmmTMB on log10(counts), with 0's estimated to 0.1 so
resulting in -1, I get an AIC of 2068  (with lmer: 2122). Looks fine, but
may be wrong.
               When I use glmmTMB on counts with either genpois (dispersion par
613), negbinom1 (dispersion par 287) or negbinom2 (dispersion par 0.72), I
get AIC's over 8036. Much higher, but may be ok.

You can’t compare the models of the log-transformed data to the raw data. For example,

> set.seed(1)

> x=rpois(100, lambda=5)

> AIC(glmmTMB(log(x)~1))

[1] 128.0742

> AIC(glmmTMB(x~1, family=poisson))

[1] 422.911

or see discussion here  <https://stats.stackexchange.com/questions/61332/comparing-aic-of-a-model-and-its-log-transformed-version> https://stats.stackexchange.com/questions/61332/comparing-aic-of-a-model-and-its-log-transformed-version

               My data are zero-inflated and overdispersed and I would think that
glmmTMB with generalized models would result in much better models (lower
AIC) than simply working with the log-transformed data.
               The p-values per variable are similar enough, by the way, see the
best two models at the end of this mail.
               Of course, simply transforming 0 counts into -1 at the log-level
could be the cause and this approach may oversimplify reality and the AIC of
2068 could be artificial.
               If overdispersion and zero-inflation really is necessary, do I need
to get the AIC  down from 8036 to 2068 or can I accept higher AICs? I
suppose I can.
               But then: how should I approach the development of the zi-model
and/or the overdispersion models? 
               I know, from theory, but the thing is, there is little of no
research on invertebrates in drinking water distribution systems and their
structure is so different from surface water systems, that we are developing
hypotheses from this data set.

Summary of design and model
- Invertebrates in drinking water distribution systems in The Netherlands:
1993-1995 (yes, very old data!).
- glmmTMB of multilevel model  (1 | vNr / lNr)  : 34 systems (v), 175
sampling locations (l, ~5/system), 1301 samples (~ 8 quarters from
1993-1995), a multitude of variables measured.
- One of the best model tested: lWapit (count data)  ~ pTDOC + tCa + logtMn
+ lnOType + logbS500 + bTemp + blWavlo + blRoeiNaup + blMoskr

Background
The data were collected in the '90's and basic results were published in
2012:
 <https://www.sciencedirect.com/science/article/abs/pii/S0043135412002217?via%25> https://www.sciencedirect.com/science/article/abs/pii/S0043135412002217?via%
3Dihub
Dissolved organic matter is the best (causal / proxy / collinear?) predictor
for energy and carbon supply (R2 ~ 0.6 on mean estimated mean biomass at the
system level).  
I can send you the paper if you want. Also, I can sent more details, short
of the data set.
Since, when I have time (no funding), I try to find more predictors, at more
than just the highest aggregated level (system). I followed some courses on
multilevel modeling was well.
In 2013 a statistician using GenStat told me my data were zero-inflated and
overdispersed.
So, no glmm with Poisson response possible. The only option was: first a
glmm binomial for absence - presence, then glmm Poisson on the
presence-data.

The past two weeks (finally, I found some time again) I was and am so happy
to find Ben Bolker's  glmmTMB, able to work with zero-inflation and
overdispersion (I heard of MCMC options in 2017, no time then).
Learning from Ben Bolker's Salamanders-work, I managed to come a long way,
but I have not been able to develops stable overdispersion or zero-inflation
generalized models that significantly lower AIC in glmmTMB.
Although I teach the basics of statistics and made a lot of LM-models, I am
not a statistician (I'm a biologist happily forced toward statistics), and I
find a lot of details and mathematics hard to grasp:
 <https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html> https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html

I’m glad that glmmTMB is solving some of your long-standing problems and I agree that Ben Bolker has contributed immensely to glmmTMB and GLMMs in general, but calling it "Ben Bolker’s glmmTMB" is disregarding the other developers of the package and documentation.

Model and comparison approach
• System-level variable names start with p, location-level variables start
with l, t or log t, sample-level variables start with b or logb. Only
lnOType is a three types factor (bl are log-counts of other taxa)
• tCa and tMn = calcium and manganese in tap water (mean over time), lnOType
= village, city or rural environment, lbS500 = sediment > 500 um per sample,
bTemp = temperature sample
• blWavlo, blRoeipNaup, blMoskr = log(count(taxon; -1 <- 0)) per sample for
Cladocera, Copepoda, Ostracoda 
• 0-model contains no parameters (response ~ 1), 1-model contains major
predictor (pTDOC), full model contains 21 likely/possible predictors
• Model is kept identical in all regressions, although other versions may
have lower AIC
• Model data for comparison = all data  (during model development, systems
were randomly split approx. 60-40)

Make sure you’re using the same data for all models in AIC comparisons. 

• I did not include overdispersion or zero-inflated models yet, as I am not
sure whether it is necessary and I cannot get the basic ones (e.g. just with
pTDOC) stable. I can imagine that adding empty ZI-models is not very
effective in countering zero-inflation

For people in your situation, I typically recommend fitting a negative binomial model (see Warton, Environmetrics 2005), then testing for zero-inflation (I typically use DHARMa, but it sounds like sjstats does this also). Then if you have zero-inflation, you could fit a zero-inflated negative binomial. Then if the nbinom2 dispersion parameter in the conditional model gets very large, it means you might as well use a zero-inflated Poisson (see nbinom2 in ?sigma.glmmTMB for the reason). However, the best distribution could change depending on the predictors in the model because a model that explains less of the variance might have more dispersion. As you saw in the salamander examples (Brooks et al. 2017, R Journal, Appendix A), you can try different zero-inflation models.

cheers,

Mollie

Results (I can send more details, if required)

AIC per model (dispersion only for best model: x-model)

multilevel model:  + (1|vNr / lNr) for all except lm

response = blWapit = log(count(bWapi)), where -1 <- 0)  (counts expressed
per m3)

lm                         0-model              1-model              x-model               full model
Gaussian             4293.4                 4014.5                 3778.2                 3642.9

lmer                     0-model              1-model              x-model               full model
Gaussian             2185.8                 2122                     2121.9                 2185.8

glmmTMB           0-model              1-model              x-model               full model
Gaussian             2128.7                2116.6                 2068.2                 2074

response = b4Wapit = count(bWapi) expressed as rounded per 4 m3 (most sample
volumes are very close to that)

glmmTMB           Disp ratio (p)      Dispersion par   zc ratio                zi-model
0-model              1-model              x-model               full model           remarks
poisson               99.4 (0) *            NA                        0.81 **                NA
137165                137157                117856                114773                * p (H0: not
overdispersed) **zero-inflation probable
genpois                              0.34 (1)                613                       NA                        NA
8096.8                 8088.1                 8036.1                 8042.7  
genpois (+ZI)      NA                        603                       NA                        zi =~ 1
8094.1                 8085.5                 8036.6                 8043.1  
trunc genpois    NA                        701 (1-model)    NA                        zi =~ 1
9109.7                 9097.5                 *                           *                           *with zi =
~1 or zi =~pTDOC, non-positive-definite Hessian matrix
nbinom1             0.53 (1)                287                       NA                        NA
8306.4 *             8297.7 *             8244.6 *             8251.8 *             * warnings:
In f(par, order = order, ...) : value out of range in 'lgamma'
nbinom1 (+ZI)    NA                        287                       NA                        zi =~ 1
8306.4                 8299.7                 8246.6 *             8253.7                 * warnings:
In f(par, order = order, ...) : value out of range in 'lgamma'
nbinom2             0.78                      0.72                      NA                        NA
8224.1                 8216.0                 8165.3                 8171.8  
nbinom2 (+ZI)    NA                        0.787                   NA                        zi =~ 1
8226.1                 8218.0                 8165.3                 8172.6  

Comparing the best generalized glmmTMB model (nbinom2) on counts with the
best Gaussian model on log10(counts, 0 -> -1)

Family: nbinom2  ( log )
Family: gaussian  ( identity )                                                                  
Formula:          b4Wapit ~ pTDOC + tCa + logtMn + lnOType + logbS500 +
bTemp +             Formula:          blWapit ~ pTDOC + tCa + logtMn + lnOType +
logbS500 + bTemp +                                                                 
   blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)
blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)

Data: AllData
Data: AllData                                                                

    AIC      BIC   logLik deviance df.resid
AIC      BIC   logLik deviance df.resid

 8165.3   8237.7  -4068.6   8137.3     1287
2068.2   2140.6  -1020.1   2040.2     1287

Random effects:
Random effects:                                                                        

Conditional model:
Conditional model:                                                                    
Groups  Name        Variance Std.Dev.
Groups   Name        Variance Std.Dev.                                                                
lNr:vNr (Intercept) 4.325    2.080
lNr:vNr  (Intercept) 0.4850   0.6964                                                                   
vNr     (Intercept) 5.913    2.432
vNr      (Intercept) 0.4001   0.6326                                                                      

Residual             0.1794   0.4236                                                                          
Number of obs: 1301, groups:  lNr:vNr, 175; vNr, 34
Number of obs: 1301, groups:  lNr:vNr, 175; vNr, 34

Overdispersion parameter for nbinom2 family (): 0.72
Dispersion estimate for gaussian family (sigma^2): 0.179

Conditional model:
Conditional model:                                                                    
               Estimate Std. Error z value Pr(>|z|)
Estimate Std. Error z value Pr(>|z|)

(Intercept) -5.25438    1.80820  -2.906 0.003662 **
(Intercept) -1.566112   0.487738  -3.211  0.00132 **

pTDOC        0.95740    0.26583   3.602 0.000316 ***
pTDOC        0.298345   0.070836   4.212 2.53e-05 ***

tCa          0.06371    0.01623   3.926 8.64e-05 ***
tCa          0.013963   0.004579   3.050  0.00229 **

logtMn       0.83018    0.49447   1.679 0.093164 .
logtMn       0.243523   0.135480   1.797  0.07226 .

lnOTypeland  0.97151    0.51131   1.900 0.057425 .
lnOTypeland  0.390505   0.152519   2.560  0.01046 *

lnOTypestad -0.72832    0.82751  -0.880 0.378788
lnOTypestad  0.112042   0.231978   0.483  0.62911

logbS500     0.44416    0.10870   4.086 4.39e-05 ***
logbS500     0.127756   0.029836   4.282 1.85e-05 ***

bTemp        0.03655    0.01301   2.810 0.004948 **
bTemp        0.007290   0.003264   2.234  0.02551 *

blWavlo      0.14475    0.04158   3.481 0.000500 ***
blWavlo      0.036470   0.011718   3.112  0.00186 **

blRoeiNaup   0.11404    0.05684   2.006 0.044818 *
blRoeiNaup   0.042573   0.015790   2.696  0.00701 **

blMoskr     -0.25836    0.11832  -2.184 0.028993 *
blMoskr     -0.066737   0.030350  -2.199  0.02788 *

---
---                         
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Many thanks in advance for your help!

Kind regards,

Hein van Lieverloo

Met vriendelijke groet,

Hein van Lieverloo

_______________________________________________
R-sig-mixed-models using r-project.org <mailto:R-sig-mixed-models using r-project.org>  mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

———————————
Mollie E. Brooks, Ph.D.
Research Scientist
National Institute of Aquatic Resources
Technical University of Denmark

	[[alternative HTML version deleted]]