[R-sig-ME] Model specification/family for a continuous/proportional, response with many zeros

Thu May 20 11:08:33 CEST 2021

Dear Mike,

Your code is completely unreadable. Did you send it in plain text? Sending
HTML email tends to mangle code.

I check the distribution by running a large number of simulations to get
the range of potential values. Then I divide the observed density (from the
data) by the expected density (from the simulations). See
https://inlatools.netlify.app/articles/distribution.html#distribution-checks-1
for an example. This is based on INLA models and doesn't handle beta
distributions (yet). But you can get an idea of how you can do this
yourself.

Best regards,

Thierry

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx using inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>

Op do 20 mei 2021 om 10:46 schreef Michael Lawson via R-sig-mixed-models <
r-sig-mixed-models using r-project.org>:

> Hi again,
>
> I wanted to look a little more at the rest of the simulated values
> (non-zeros), so I took the mean proportion of simulated values in set bins
> of 0.01 and compared this to the real values.
>
>
> *bins <- seq(from = 0, to = 1, by = 0.01)*
>
>
>
>
>
>
>
>
>
>
>
>
> *sim_seq <- c()for(i in bins){  a <- mean(unlist(lapply(sim_beta_glmm,
> function(x){ length(which(x>i & x <= i+0.01))/length(x)}), use.names =
> FALSE))  sim_seq <- c(sim_seq,a)}real_seq <- c()for(i in bins){  a <-
> print(mean(unlist(lapply(glmm_zone_data$prop_time, function(x){
> length(which(x>i & x <= i+0.01))/length(x)}), use.names = FALSE)))
> real_seq <- c(real_seq,a)}*
>
>
> *barplot(rbind(sim_seq,real_seq),col=c("green","red"),beside = TRUE,
> legend.text = c("simulated","real"))*
>
>
> *cbind(bins,sim_seq,real_seq)*
>
>
>
>
>
>
>
>
>
>
>
> *bins         sim_seq    real_seq  [1,] 0.00 0.2199795081967 0.232240437
> [2,] 0.01 0.1021612021858 0.166666667  [3,] 0.02 0.0750964480874
> 0.103825137  [4,] 0.03 0.0592128415301 0.073770492  [5,] 0.04
> 0.0478316939891 0.038251366  [6,] 0.05 0.0397267759563 0.030054645  [7,]
> 0.06 0.0331961748634 0.016393443  [8,] 0.07 0.0279322404372 0.013661202
> [9,] 0.08 0.0238013661202 0.024590164 [10,] 0.09 0.0201101092896
> 0.010928962*
>
> As you can see in the output above (and the plot attached), the real data
> is quite a bit more right-skewed compared to the simulated values. Does
> this look like a good enough fit or will I have to try a different model?
>
> Many thanks,
> Mike
>
> On Wed, 19 May 2021 at 12:38, Michael Lawson <mrml500 using york.ac.uk> wrote:
>
> > Dear Alain,
> >
> > Thank you for the suggestion. I tried simulating the 0's as you suggested
> > in the following way. The output is reassuring - with the actual number
> of
> > zeros in the middle of the distribution.
> >
> > sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
> > sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
> > length(which(x==0))/length(x)}), use.names = FALSE)
> > hist(sim_zeros, breaks = c(100))
> > abline(v = plogis(-1.1804), col = "red")
> >
> > All the best,
> > Mike
> >
> >
> > On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via
> > R-sig-mixed-models <r-sig-mixed-models using r-project.org> wrote:
> >
> >>
> >> > Today's Topics:
> >> >
> >> >     1. Re:  Model specification/family for a continuous/proportional
> >> >        response with many zeros (Michael Lawson)
> >> >
> >> > ----------------------------------------------------------------------
> >> >
> >> > Message: 1
> >> > Date: Wed, 19 May 2021 09:31:46 +0100
> >> > From: Michael Lawson <mrml500 using york.ac.uk>
> >> > To: Thierry Onkelinx <thierry.onkelinx using inbo.be>
> >> > Cc: r-sig-mixed-models <r-sig-mixed-models using r-project.org>
> >> > Subject: Re: [R-sig-ME]  Model specification/family for a
> >> >       continuous/proportional response with many zeros
> >> > Message-ID:
> >> >       <
> >> CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw using mail.gmail.com>
> >> > Content-Type: text/plain; charset="utf-8"
> >> >
> >> > Hi Thierry,
> >> >
> >> > I understood after your previous message that a high dispersion
> >> parameter
> >> > in beta models does not signify overdispersion. Both answers addressed
> >> this
> >> > misconception in those two links I provided.
> >> >
> >> > I suppose the real answer I am now looking for is -  how do I assess
> the
> >> > validity and fit of the model? Are there specifics or assumptions I
> must
> >> > take into account with zero-inflated beta glmms?
> >>
> >> Just simulate 1000 data sets from your model, and see whether the
> >> simulated data sets are comparable to your original data. One of the
> >> things you can look at is whether the simulated data sets contain
> >> similar number of zeros as in the observed data. I'm not sure whether
> >> the DHARMA package can do this for you. If not..it is easy to program.
> >>
> >> See also Figure 8 in (sorry for self-citing):
> >>
> >>
> >>
> https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577
> >>
> >> Alain
> >>
> >>
> >> > Thanks for your help,
> >> > Mike
> >> >
> >> > On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <
> >> thierry.onkelinx using inbo.be>
> >> > wrote:
> >> >
> >> >> Dear Mike,
> >> >>
> >> >> I think you misread my reply. I never stated that there's something
> >> wrong
> >> >> with the model. The only "problem" I highlighted was your
> misconception
> >> >> about the "high overdispersion". In this case, a high parameter value
> >> >> indicates a low variance, which is what we mostly want to see.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> ir. Thierry Onkelinx
> >> >> Statisticus / Statistician
> >> >>
> >> >> Vlaamse Overheid / Government of Flanders
> >> >> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
> NATURE
> >> AND
> >> >> FOREST
> >> >> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
> >> >> thierry.onkelinx using inbo.be
> >> >> Havenlaan 88 bus 73, 1000 Brussel
> >> >> www.inbo.be
> >> >>
> >> >>
> >> >>
> >>
> ///////////////////////////////////////////////////////////////////////////////////////////
> >> >> To call in the statistician after the experiment is done may be no
> more
> >> >> than asking him to perform a post-mortem examination: he may be able
> >> to say
> >> >> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> >> >> The plural of anecdote is not data. ~ Roger Brinner
> >> >> The combination of some data and an aching desire for an answer does
> >> not
> >> >> ensure that a reasonable answer can be extracted from a given body of
> >> data.
> >> >> ~ John Tukey
> >> >>
> >> >>
> >>
> ///////////////////////////////////////////////////////////////////////////////////////////
> >> >>
> >> >> <https://www.inbo.be>
> >> >>
> >> >>
> >> >> Op di 18 mei 2021 om 12:49 schreef Michael Lawson <
> mrml500 using york.ac.uk
> >> >:
> >> >>
> >> >>> Dear Thierry,
> >> >>>
> >> >>> Thanks for the help. So if the dispersion parameter in this model
> >> doesn't
> >> >>> fit with the beta distribution, are there any alternative approaches
> >> I can
> >> >>> use?
> >> >>>
> >> >>> I can't seem to find much information on this elsewhere other than
> >> these
> >> >>> two threads:
> >> >>> https://stats.stackexchange.com/a/451453/233414
> >> >>> https://stats.stackexchange.com/a/466951/233414
> >> >>>
> >> >>> All the best,
> >> >>> Mike
> >> >>>
> >> >>> On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <
> >> thierry.onkelinx using inbo.be>
> >> >>> wrote:
> >> >>>
> >> >>>> Dear Mike,
> >> >>>>
> >> >>>> The zero-inflation is specified on the logit scale. plogis(-1.18) =
> >> >>>> 0.235 23.5% zero seems reasonable when reading your story. (Didn't
> >> look at
> >> >>>> the data).
> >> >>>>
> >> >>>> You need to look at the definition for the "over"dispersion
> >> parameter.
> >> >>>> For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) /
> >> (\phi + 1)
> >> >>>> (see ?glmmTMB::beta_family) Hence a large value of \phi implies a
> low
> >> >>>> variance.
> >> >>>>
> >> >>>> Best regards,
> >> >>>>
> >> >>>> ir. Thierry Onkelinx
> >> >>>> Statisticus / Statistician
> >> >>>>
> >> >>>> Vlaamse Overheid / Government of Flanders
> >> >>>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
> >> NATURE
> >> >>>> AND FOREST
> >> >>>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
> Assurance
> >> >>>> thierry.onkelinx using inbo.be
> >> >>>> Havenlaan 88 bus 73, 1000 Brussel
> >> >>>> www.inbo.be
> >> >>>>
> >> >>>>
> >> >>>>
> >>
> ///////////////////////////////////////////////////////////////////////////////////////////
> >> >>>> To call in the statistician after the experiment is done may be no
> >> more
> >> >>>> than asking him to perform a post-mortem examination: he may be
> able
> >> to say
> >> >>>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> >> >>>> The plural of anecdote is not data. ~ Roger Brinner
> >> >>>> The combination of some data and an aching desire for an answer
> does
> >> not
> >> >>>> ensure that a reasonable answer can be extracted from a given body
> >> of data.
> >> >>>> ~ John Tukey
> >> >>>>
> >> >>>>
> >>
> ///////////////////////////////////////////////////////////////////////////////////////////
> >> >>>>
> >> >>>> <https://www.inbo.be>
> >> >>>>
> >> >>>>
> >> >>>> Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <
> >> mrml500 using york.ac.uk>:
> >> >>>>
> >> >>>>> Hi Thierry,
> >> >>>>>
> >> >>>>> Thank you for your advice and speedy response.
> >> >>>>>
> >> >>>>> Most of the data is closer to the lower bound (0). e.g. the mean
> >> time
> >> >>>>> for group A in zone A = 15.1 seconds and group A in zone B = 3.8
> >> seconds.
> >> >>>>> However there are a very small number of outliers near the upper
> >> bound, the
> >> >>>>> largest being 294 out of the 300 seconds (see the attached file if
> >> you want
> >> >>>>> to look at the data).
> >> >>>>>
> >> >>>>> I have taken a stab at running a Zero-inflated Beta GLMM using
> >> glmmTMB
> >> >>>>> in R like so:
> >> >>>>>
> >> >>>>> betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
> >> >>>>>                               family = beta_family(),
> >> >>>>>                               ziformula=~1,
> >> >>>>>                               data = glmm_zone_data)
> >> >>>>>
> >> >>>>> summary(beta_mod)
> >> >>>>>
> >> >>>>> *Family: beta  ( logit )*
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> *Formula:          prop_time ~ group * zone + (1 | id)Zero
> >> inflation:
> >> >>>>>            ~1Data: glmm_zone_data     AIC      BIC   logLik
> deviance
> >> >>>>> df.resid  -763.6   -736.3    388.8   -777.6      359Random
> >> >>>>> effects:Conditional model: Groups Name        Variance  Std.Dev.
> id
> >> >>>>> (Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
> >> >>>>> 14Overdispersion parameter for beta family (): 13.1Conditional
> >> model:
> >> >>>>>              Estimate Std. Error z value Pr(>|z|)    (Intercept)
> >> >>>>>   -2.7685     0.1031 -26.844  < 2e-16 ***groupB
>  -0.4455
> >> >>>>> 0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524
> >> -2.741
> >> >>>>> 0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
> >> >>>>> ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
> >> >>>>> 1Zero-inflation model:            Estimate Std. Error z value
> >> Pr(>|z|)
> >> >>>>>    (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif.
> >> codes:  0
> >> >>>>> ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*
> >> >>>>>
> >> >>>>> Does this look like the correct way of specifying the model? I am
> a
> >> >>>>> little confused about specifying and interpreting the
> zero-inflation
> >> >>>>> component - I have only just begun reading about this.
> >> >>>>>
> >> >>>>> I noticed that the dispersion parameter is quite high at 13.1. I'm
> >> not
> >> >>>>> sure if this matters for beta models?. I tried running DHARMa
> >> >>>>> simulateResiduals on the model output and got significant
> >> deviations in the
> >> >>>>> dispersion (<2.2e-16) and KS tests. e.g.
> >> >>>>>
> >> >>>>> DHARMa::testDispersion(beta_mod)
> >> >>>>>
> >> >>>>> *DHARMa nonparametric dispersion test via sd of residuals fitted
> vs.
> >> >>>>> simulated*
> >> >>>>>
> >> >>>>> *data:  simulationOutput*
> >> >>>>> *ratioObsSim = 1.3612, p-value < 2.2e-16*
> >> >>>>> *alternative hypothesis: two.sided*
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> Many thanks,
> >> >>>>> Mike
> >> >>>>>
> >> >>>>> On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
> >> >>>>> thierry.onkelinx using inbo.be> wrote:
> >> >>>>>
> >> >>>>>> Dear Michael,
> >> >>>>>>
> >> >>>>>> Your data has bounds (lower bound at 0 and upper bound at 300)
> and
> >> you
> >> >>>>>> have a lot of data close to a boundary. In such a case, a
> >> continuous
> >> >>>>>> distribution which ignores those bound is not a good idea. If the
> >> time
> >> >>>>>> spent outside of both zones is limited, then a long time in zone
> A
> >> excludes
> >> >>>>>> a long time in zone B by definition. Then I'd look towards a
> >> multinomial
> >> >>>>>> distribution. If the time spent outside both zones is dominant,
> >> then you
> >> >>>>>> can use a zero-inflated beta as you suggested. A zero-inflated
> >> gamma might
> >> >>>>>> be OK if the data is not too close to the upper boundary. If you
> >> are
> >> >>>>>> considering zero-inflated beta vs zero-inflated gamma, then you
> >> should
> >> >>>>>> choose zero-inflated beta IMHO.
> >> >>>>>>
> >> >>>>>> Best regards,
> >> >>>>>>
> >> >>>>>> ir. Thierry Onkelinx
> >> >>>>>> Statisticus / Statistician
> >> >>>>>>
> >> >>>>>> Vlaamse Overheid / Government of Flanders
> >> >>>>>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
> >> NATURE
> >> >>>>>> AND FOREST
> >> >>>>>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
> >> Assurance
> >> >>>>>> thierry.onkelinx using inbo.be
> >> >>>>>> Havenlaan 88 bus 73, 1000 Brussel
> >> >>>>>> www.inbo.be
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >>
> ///////////////////////////////////////////////////////////////////////////////////////////
> >> >>>>>> To call in the statistician after the experiment is done may be
> no
> >> >>>>>> more than asking him to perform a post-mortem examination: he may
> >> be able
> >> >>>>>> to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
> >> >>>>>> The plural of anecdote is not data. ~ Roger Brinner
> >> >>>>>> The combination of some data and an aching desire for an answer
> >> does
> >> >>>>>> not ensure that a reasonable answer can be extracted from a given
> >> body of
> >> >>>>>> data. ~ John Tukey
> >> >>>>>>
> >> >>>>>>
> >>
> ///////////////////////////////////////////////////////////////////////////////////////////
> >> >>>>>>
> >> >>>>>> <https://www.inbo.be>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
> >> >>>>>> R-sig-mixed-models <r-sig-mixed-models using r-project.org>:
> >> >>>>>>
> >> >>>>>>> Hello,
> >> >>>>>>>
> >> >>>>>>> I am new to GLMMs and have a dataset where I have two distinct
> >> groups
> >> >>>>>>> (A
> >> >>>>>>> and B) of 7 individuals each. The data consists of repeated
> >> >>>>>>> measurements of
> >> >>>>>>> each individual where the amount of time spent at either zone_A
> or
> >> >>>>>>> zone_B
> >> >>>>>>> is recorded (out of a total time of 300s/observation period).
> For
> >> >>>>>>> most of
> >> >>>>>>> the time period the individuals are in neither zone.
> >> >>>>>>>
> >> >>>>>>> I want to test if group A and group B spend more time in zone A
> >> >>>>>>> compared to
> >> >>>>>>> zone B (and vice versa).
> >> >>>>>>>
> >> >>>>>>> Speaking to someone else, they said I should use a Binomial GLMM
> >> using
> >> >>>>>>> cbind. i.e.
> >> >>>>>>> cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).
> >> >>>>>>>
> >> >>>>>>> However, the response variable is continuous (albeit with an
> upper
> >> >>>>>>> bound of
> >> >>>>>>> 300 seconds per observation period), so I'm not sure if this is
> >> >>>>>>> appropriate?
> >> >>>>>>>
> >> >>>>>>> Should I convert the response into a proportion and use
> something
> >> >>>>>>> like a
> >> >>>>>>> Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something
> >> like:
> >> >>>>>>> prop_time ~ zone*group + (1|id)
> >> >>>>>>>
> >> >>>>>>> The data is quite heavily right-skewed and contains a lot of
> 0's,
> >> so
> >> >>>>>>> reading around it also looks like I may need to convert these
> >> into a
> >> >>>>>>> zero-inflated/hurdle model?
> >> >>>>>>>
> >> >>>>>>> Thank you for any suggestions,
> >> >>>>>>> Mike
> >> >>>>>>>
> >> >>>>>>>          [[alternative HTML version deleted]]
> >> >>>>>>>
> >> >>>>>>> _______________________________________________
> >> >>>>>>> R-sig-mixed-models using r-project.org mailing list
> >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >> >>>>>>>
> >> >       [[alternative HTML version deleted]]
> >> >
> >> >
> >> >
> >> >
> >> > ------------------------------
> >> >
> >> > Subject: Digest Footer
> >> >
> >> > _______________________________________________
> >> > R-sig-mixed-models mailing list
> >> > R-sig-mixed-models using r-project.org
> >> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >> >
> >> >
> >> > ------------------------------
> >> >
> >> > End of R-sig-mixed-models Digest, Vol 173, Issue 19
> >> > ***************************************************
> >>
> >> --
> >> Dr. Alain F. Zuur
> >> Highland Statistics Ltd.
> >> 9 St Clair Wynd
> >> AB41 6DZ Newburgh, UK
> >> Email: highstat using highstat.com
> >> URL:   www.highstat.com
> >>
> >> _______________________________________________
> >> R-sig-mixed-models using r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >>
> >
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>

	[[alternative HTML version deleted]]