[R-sig-ME] Multilevel zero-inflated models: model selection and model assumption checks

Tue Apr 28 01:58:53 CEST 2020

Thank you in advance for reading my long-winded email! I appreciate any
help and guidance.

I am working on research that involves frequency of events (summed during
an hour) and duration of events (calculated a seconds during an hour). Both
data types have 0s and are positively skewed. The observations (n =
14,000+) are collected hourly for up to 2 weeks for 500 participants. A
simple version of the model is: outcome ~ explanatory +
(1|participant/date).

For the frequency data, I have fit:

   - (Hurdle) Poisson / Negative Binomial (the var > mean)
   - (Hurdle) Gamma (log link) (assuming okay to treat frequency data as
   continuous)

The AIC is best for the hurdle negative binomial model. I understand that
for AIC comparisons, the Poisson models do not get an extra scale
parameter, while the other models. However, the difference in AICs are in
the hundreds, so I imagine the small difference in k parameters would not
change interpretation, as long as the log-likelihood functions are
similar...

(Q1) Does the above sound reasonable?

For the duration data, I have fit:

   - (Hurdle) Gamma (log link)
   - Zero-Inflated Beta (after transforming duration by SECONDS/3600. I
   decided to try Beta because the hourly data is bounded at 3600 seconds and
   I was not sure if an upper bound affects the gamma distribution).

(Q2) Does an upper bound mean make it inappropriate to fit a Gamma
distribution?

N.B. If fitting regular gamma/beta, I first transformed data to remove 0s.
For both the regular and zero-inflated beta, I shrunk the 1s using EITHER
the algorithm here: https://www.ncbi.nlm.nih.gov/pubmed/16594767 OR the
inverse hyperbolic sine transformation (IHST)). I had thought about using
ZOIB, but I did not want to use BRMS or GAMLSS (since I am a beginner).

Because I transformed the data in different ways for the different models,
I understand that I cannot compare the AICs. Therefore, I thought about
using cross fold validation. However, the package I was recommended to use
(cvms https://github.com/LudvigOlsen/cvms) says it supports lmer/glmer, but
doesn't appear to support glmmTMB, which is how I fit the above models.

I thought a potential solution might be to fit two separates models
(binomial for 0/1 data and count/continuous for positive data, as suggested
in this post: assessing glmmTMB hurdle model fit using DHARMa scaled
residual plot
<https://stats.stackexchange.com/questions/400147/assessing-glmmtmb-hurdle-model-fit-using-dharma-scaled-residual-plot>).
In this way, I could fit lmer/glmer, and perform cross fold validation
using cvms on the positive data. But I'm not sure if this makes sense (to
partition out the 0 data, and only use the positive data to validate the
models)? I suppose I could simulate data to check predictive accuracy on
the glmmTMB models? Is this a reasonable method of comparing the models?

(Q3) How would you recommend comparing the models with different
distributions and family transformations?

Additionally, I considered fitting a Tweedie distribution to both the
frequency and duration. However, I was having trouble finding guidance on
how to interpret the beta coefficients.

(Q4) How does one interpret Tweedie beta coefficients? Do you exponentiate
and discuss as rates?

Finally, I have used the performance package to check the model
assumptions. The results can be found at this link:

<
https://docs.google.com/document/d/1s6qtSAvw297F_cvblAq7Gh2fWgFmz9I5Nxi1K2gjScI/edit?usp=sharing
>

Pic 1 and 2 are both for the Hurdle Gamma on Duration, but have different
explanatory variables. The pics are pretty similar to the results for the
other models/distributions. I have been reading conflicting advice as to
whether the residuals for GLM models need to be normally distributed, etc.
I understand for large datasets, non-normality may be a non-issue even for
linear regression. We are not concerned with predicting new data, but
explaining the data that we have. Therefore, given the size of my sample
and the distributions. how concerned should I be about the QQ plots and
homogeneity of variance plots below?

Thank you very much!

	[[alternative HTML version deleted]]