[R-sig-ME] Overdispersed and zero-inflated - or not - and if so, how to model them? #glmmTMB
Mollie Brooks
mo|||eebrook@ @end|ng |rom gm@||@com
Thu Mar 14 16:34:20 CET 2019
Dear Hein,
See replies below...
> On 14Mar 2019, at 15:46, Hein van Lieverloo <hein.van.lieverloo using viaeterna.nl> wrote:
>
> Dear all,
>
> Keywords: #glmmTMB #overdisp #zero_count
>
> I am grateful for this mailing list and in advance, for any helpful
> response.
> This e-mail has two related questions.
> Details (summary, background, approach and results) are given below them.
>
> Question 1: my data are zero-inflated and overdispersed, but what does the
> overdispersion parameter in glmmTMB (genpois, negbin1, negbin2) tell me?
> It is very high in genpois and negbin1 models (see question 2) and I
> thought it should be near 1, like in negbin2 (>> 1 is overdispersed, <<1 is
> underdispersed)
> But when I test these generalized models for overdispersion
> (overdisp from sjstats), no overdispersion is indicated.
The dispersion parameter in a glmmTMB model is there to handle the dispersion and it’s fine if it’s different from 1. So your tests with sjstats seemed to be correct. For descriptions of how the dispersion parameters relate to the variance, see ?sigma.glmmTMB
>
> Question 2: should I use Gaussian on log(counts) with AIC 2068 or use
> negbin2 with AIC 8036 and add overdispersion and zero-inflation models to
> get a lower AIC (and if so, how?)
> When I use glmmTMB on counts with poisson, I get an AIC of 117 856.
> Testing the model with overdisp and zero_count (from the sjstats package), I
> find p = 0 (overdispersed) and zc-ratio 0.81 (probable zero-inflation).
> When I use glmmTMB on log10(counts), with 0's estimated to 0.1 so
> resulting in -1, I get an AIC of 2068 (with lmer: 2122). Looks fine, but
> may be wrong.
> When I use glmmTMB on counts with either genpois (dispersion par
> 613), negbinom1 (dispersion par 287) or negbinom2 (dispersion par 0.72), I
> get AIC's over 8036. Much higher, but may be ok.
You can’t compare the models of the log-transformed data to the raw data. For example,
> set.seed(1)
> x=rpois(100, lambda=5)
> AIC(glmmTMB(log(x)~1))
[1] 128.0742
> AIC(glmmTMB(x~1, family=poisson))
[1] 422.911
or see discussion here https://stats.stackexchange.com/questions/61332/comparing-aic-of-a-model-and-its-log-transformed-version
>
> My data are zero-inflated and overdispersed and I would think that
> glmmTMB with generalized models would result in much better models (lower
> AIC) than simply working with the log-transformed data.
> The p-values per variable are similar enough, by the way, see the
> best two models at the end of this mail.
> Of course, simply transforming 0 counts into -1 at the log-level
> could be the cause and this approach may oversimplify reality and the AIC of
> 2068 could be artificial.
> If overdispersion and zero-inflation really is necessary, do I need
> to get the AIC down from 8036 to 2068 or can I accept higher AICs? I
> suppose I can.
> But then: how should I approach the development of the zi-model
> and/or the overdispersion models?
> I know, from theory, but the thing is, there is little of no
> research on invertebrates in drinking water distribution systems and their
> structure is so different from surface water systems, that we are developing
> hypotheses from this data set.
>
>
> Summary of design and model
> - Invertebrates in drinking water distribution systems in The Netherlands:
> 1993-1995 (yes, very old data!).
> - glmmTMB of multilevel model (1 | vNr / lNr) : 34 systems (v), 175
> sampling locations (l, ~5/system), 1301 samples (~ 8 quarters from
> 1993-1995), a multitude of variables measured.
> - One of the best model tested: lWapit (count data) ~ pTDOC + tCa + logtMn
> + lnOType + logbS500 + bTemp + blWavlo + blRoeiNaup + blMoskr
>
>
> Background
> The data were collected in the '90's and basic results were published in
> 2012:
> https://www.sciencedirect.com/science/article/abs/pii/S0043135412002217?via%
> 3Dihub
> Dissolved organic matter is the best (causal / proxy / collinear?) predictor
> for energy and carbon supply (R2 ~ 0.6 on mean estimated mean biomass at the
> system level).
> I can send you the paper if you want. Also, I can sent more details, short
> of the data set.
> Since, when I have time (no funding), I try to find more predictors, at more
> than just the highest aggregated level (system). I followed some courses on
> multilevel modeling was well.
> In 2013 a statistician using GenStat told me my data were zero-inflated and
> overdispersed.
> So, no glmm with Poisson response possible. The only option was: first a
> glmm binomial for absence - presence, then glmm Poisson on the
> presence-data.
>
> The past two weeks (finally, I found some time again) I was and am so happy
> to find Ben Bolker's glmmTMB, able to work with zero-inflation and
> overdispersion (I heard of MCMC options in 2017, no time then).
> Learning from Ben Bolker's Salamanders-work, I managed to come a long way,
> but I have not been able to develops stable overdispersion or zero-inflation
> generalized models that significantly lower AIC in glmmTMB.
> Although I teach the basics of statistics and made a lot of LM-models, I am
> not a statistician (I'm a biologist happily forced toward statistics), and I
> find a lot of details and mathematics hard to grasp:
> https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html
>
I’m glad that glmmTMB is solving some of your long-standing problems and I agree that Ben Bolker has contributed immensely to glmmTMB and GLMMs in general, but calling it "Ben Bolker’s glmmTMB" is disregarding the other developers of the package and documentation.
>
>
> Model and comparison approach
> • System-level variable names start with p, location-level variables start
> with l, t or log t, sample-level variables start with b or logb. Only
> lnOType is a three types factor (bl are log-counts of other taxa)
> • tCa and tMn = calcium and manganese in tap water (mean over time), lnOType
> = village, city or rural environment, lbS500 = sediment > 500 um per sample,
> bTemp = temperature sample
> • blWavlo, blRoeipNaup, blMoskr = log(count(taxon; -1 <- 0)) per sample for
> Cladocera, Copepoda, Ostracoda
> • 0-model contains no parameters (response ~ 1), 1-model contains major
> predictor (pTDOC), full model contains 21 likely/possible predictors
> • Model is kept identical in all regressions, although other versions may
> have lower AIC
> • Model data for comparison = all data (during model development, systems
> were randomly split approx. 60-40)
Make sure you’re using the same data for all models in AIC comparisons.
> • I did not include overdispersion or zero-inflated models yet, as I am not
> sure whether it is necessary and I cannot get the basic ones (e.g. just with
> pTDOC) stable. I can imagine that adding empty ZI-models is not very
> effective in countering zero-inflation
>
For people in your situation, I typically recommend fitting a negative binomial model (see Warton, Environmetrics 2005), then testing for zero-inflation (I typically use DHARMa, but it sounds like sjstats does this also). Then if you have zero-inflation, you could fit a zero-inflated negative binomial. Then if the nbinom2 dispersion parameter in the conditional model gets very large, it means you might as well use a zero-inflated Poisson (see nbinom2 in ?sigma.glmmTMB for the reason). However, the best distribution could change depending on the predictors in the model because a model that explains less of the variance might have more dispersion. As you saw in the salamander examples (Brooks et al. 2017, R Journal, Appendix A), you can try different zero-inflation models.
cheers,
Mollie
>
> Results (I can send more details, if required)
>
> AIC per model (dispersion only for best model: x-model)
>
> multilevel model: + (1|vNr / lNr) for all except lm
>
>
>
> response = blWapit = log(count(bWapi)), where -1 <- 0) (counts expressed
> per m3)
>
> lm 0-model 1-model x-model full model
> Gaussian 4293.4 4014.5 3778.2 3642.9
>
>
>
> lmer 0-model 1-model x-model full model
> Gaussian 2185.8 2122 2121.9 2185.8
>
>
>
> glmmTMB 0-model 1-model x-model full model
> Gaussian 2128.7 2116.6 2068.2 2074
>
>
>
> response = b4Wapit = count(bWapi) expressed as rounded per 4 m3 (most sample
> volumes are very close to that)
>
> glmmTMB Disp ratio (p) Dispersion par zc ratio zi-model
> 0-model 1-model x-model full model remarks
> poisson 99.4 (0) * NA 0.81 ** NA
> 137165 137157 117856 114773 * p (H0: not
> overdispersed) **zero-inflation probable
> genpois 0.34 (1) 613 NA NA
> 8096.8 8088.1 8036.1 8042.7
> genpois (+ZI) NA 603 NA zi =~ 1
> 8094.1 8085.5 8036.6 8043.1
> trunc genpois NA 701 (1-model) NA zi =~ 1
> 9109.7 9097.5 * * *with zi =
> ~1 or zi =~pTDOC, non-positive-definite Hessian matrix
> nbinom1 0.53 (1) 287 NA NA
> 8306.4 * 8297.7 * 8244.6 * 8251.8 * * warnings:
> In f(par, order = order, ...) : value out of range in 'lgamma'
> nbinom1 (+ZI) NA 287 NA zi =~ 1
> 8306.4 8299.7 8246.6 * 8253.7 * warnings:
> In f(par, order = order, ...) : value out of range in 'lgamma'
> nbinom2 0.78 0.72 NA NA
> 8224.1 8216.0 8165.3 8171.8
> nbinom2 (+ZI) NA 0.787 NA zi =~ 1
> 8226.1 8218.0 8165.3 8172.6
>
>
> Comparing the best generalized glmmTMB model (nbinom2) on counts with the
> best Gaussian model on log10(counts, 0 -> -1)
>
> Family: nbinom2 ( log )
> Family: gaussian ( identity )
> Formula: b4Wapit ~ pTDOC + tCa + logtMn + lnOType + logbS500 +
> bTemp + Formula: blWapit ~ pTDOC + tCa + logtMn + lnOType +
> logbS500 + bTemp +
> blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)
> blWavlo + blRoeiNaup + blMoskr + (1 | vNr/lNr)
>
> Data: AllData
> Data: AllData
>
>
> AIC BIC logLik deviance df.resid
> AIC BIC logLik deviance df.resid
>
> 8165.3 8237.7 -4068.6 8137.3 1287
> 2068.2 2140.6 -1020.1 2040.2 1287
>
>
>
> Random effects:
> Random effects:
>
>
> Conditional model:
> Conditional model:
> Groups Name Variance Std.Dev.
> Groups Name Variance Std.Dev.
> lNr:vNr (Intercept) 4.325 2.080
> lNr:vNr (Intercept) 0.4850 0.6964
> vNr (Intercept) 5.913 2.432
> vNr (Intercept) 0.4001 0.6326
>
> Residual 0.1794 0.4236
> Number of obs: 1301, groups: lNr:vNr, 175; vNr, 34
> Number of obs: 1301, groups: lNr:vNr, 175; vNr, 34
>
>
>
> Overdispersion parameter for nbinom2 family (): 0.72
> Dispersion estimate for gaussian family (sigma^2): 0.179
>
>
>
> Conditional model:
> Conditional model:
> Estimate Std. Error z value Pr(>|z|)
> Estimate Std. Error z value Pr(>|z|)
>
> (Intercept) -5.25438 1.80820 -2.906 0.003662 **
> (Intercept) -1.566112 0.487738 -3.211 0.00132 **
>
> pTDOC 0.95740 0.26583 3.602 0.000316 ***
> pTDOC 0.298345 0.070836 4.212 2.53e-05 ***
>
> tCa 0.06371 0.01623 3.926 8.64e-05 ***
> tCa 0.013963 0.004579 3.050 0.00229 **
>
> logtMn 0.83018 0.49447 1.679 0.093164 .
> logtMn 0.243523 0.135480 1.797 0.07226 .
>
> lnOTypeland 0.97151 0.51131 1.900 0.057425 .
> lnOTypeland 0.390505 0.152519 2.560 0.01046 *
>
> lnOTypestad -0.72832 0.82751 -0.880 0.378788
> lnOTypestad 0.112042 0.231978 0.483 0.62911
>
> logbS500 0.44416 0.10870 4.086 4.39e-05 ***
> logbS500 0.127756 0.029836 4.282 1.85e-05 ***
>
> bTemp 0.03655 0.01301 2.810 0.004948 **
> bTemp 0.007290 0.003264 2.234 0.02551 *
>
> blWavlo 0.14475 0.04158 3.481 0.000500 ***
> blWavlo 0.036470 0.011718 3.112 0.00186 **
>
> blRoeiNaup 0.11404 0.05684 2.006 0.044818 *
> blRoeiNaup 0.042573 0.015790 2.696 0.00701 **
>
> blMoskr -0.25836 0.11832 -2.184 0.028993 *
> blMoskr -0.066737 0.030350 -2.199 0.02788 *
>
> ---
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>
>
> Many thanks in advance for your help!
>
> Kind regards,
>
> Hein van Lieverloo
>
>
>
> Met vriendelijke groet,
>
> Hein van Lieverloo
>
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
———————————
Mollie E. Brooks, Ph.D.
Research Scientist
National Institute of Aquatic Resources
Technical University of Denmark
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list