[R-sig-ME] Model specification/family for a continuous/proportional, response with many zeros

Michael Lawson mrm|500 @end|ng |rom york@@c@uk
Thu May 20 10:46:07 CEST 2021


Hi again,

I wanted to look a little more at the rest of the simulated values
(non-zeros), so I took the mean proportion of simulated values in set bins
of 0.01 and compared this to the real values.


*bins <- seq(from = 0, to = 1, by = 0.01)*












*sim_seq <- c()for(i in bins){  a <- mean(unlist(lapply(sim_beta_glmm,
function(x){ length(which(x>i & x <= i+0.01))/length(x)}), use.names =
FALSE))  sim_seq <- c(sim_seq,a)}real_seq <- c()for(i in bins){  a <-
print(mean(unlist(lapply(glmm_zone_data$prop_time, function(x){
length(which(x>i & x <= i+0.01))/length(x)}), use.names = FALSE)))
real_seq <- c(real_seq,a)}*


*barplot(rbind(sim_seq,real_seq),col=c("green","red"),beside = TRUE,
legend.text = c("simulated","real"))*


*cbind(bins,sim_seq,real_seq)*











*bins         sim_seq    real_seq  [1,] 0.00 0.2199795081967 0.232240437
[2,] 0.01 0.1021612021858 0.166666667  [3,] 0.02 0.0750964480874
0.103825137  [4,] 0.03 0.0592128415301 0.073770492  [5,] 0.04
0.0478316939891 0.038251366  [6,] 0.05 0.0397267759563 0.030054645  [7,]
0.06 0.0331961748634 0.016393443  [8,] 0.07 0.0279322404372 0.013661202
[9,] 0.08 0.0238013661202 0.024590164 [10,] 0.09 0.0201101092896
0.010928962*

As you can see in the output above (and the plot attached), the real data
is quite a bit more right-skewed compared to the simulated values. Does
this look like a good enough fit or will I have to try a different model?

Many thanks,
Mike

On Wed, 19 May 2021 at 12:38, Michael Lawson <mrml500 using york.ac.uk> wrote:

> Dear Alain,
>
> Thank you for the suggestion. I tried simulating the 0's as you suggested
> in the following way. The output is reassuring - with the actual number of
> zeros in the middle of the distribution.
>
> sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
> sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
> length(which(x==0))/length(x)}), use.names = FALSE)
> hist(sim_zeros, breaks = c(100))
> abline(v = plogis(-1.1804), col = "red")
>
> All the best,
> Mike
>
>
> On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via
> R-sig-mixed-models <r-sig-mixed-models using r-project.org> wrote:
>
>>
>> > Today's Topics:
>> >
>> >     1. Re:  Model specification/family for a continuous/proportional
>> >        response with many zeros (Michael Lawson)
>> >
>> > ----------------------------------------------------------------------
>> >
>> > Message: 1
>> > Date: Wed, 19 May 2021 09:31:46 +0100
>> > From: Michael Lawson <mrml500 using york.ac.uk>
>> > To: Thierry Onkelinx <thierry.onkelinx using inbo.be>
>> > Cc: r-sig-mixed-models <r-sig-mixed-models using r-project.org>
>> > Subject: Re: [R-sig-ME]  Model specification/family for a
>> >       continuous/proportional response with many zeros
>> > Message-ID:
>> >       <
>> CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw using mail.gmail.com>
>> > Content-Type: text/plain; charset="utf-8"
>> >
>> > Hi Thierry,
>> >
>> > I understood after your previous message that a high dispersion
>> parameter
>> > in beta models does not signify overdispersion. Both answers addressed
>> this
>> > misconception in those two links I provided.
>> >
>> > I suppose the real answer I am now looking for is -  how do I assess the
>> > validity and fit of the model? Are there specifics or assumptions I must
>> > take into account with zero-inflated beta glmms?
>>
>> Just simulate 1000 data sets from your model, and see whether the
>> simulated data sets are comparable to your original data. One of the
>> things you can look at is whether the simulated data sets contain
>> similar number of zeros as in the observed data. I'm not sure whether
>> the DHARMA package can do this for you. If not..it is easy to program.
>>
>> See also Figure 8 in (sorry for self-citing):
>>
>>
>> https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577
>>
>> Alain
>>
>>
>> > Thanks for your help,
>> > Mike
>> >
>> > On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <
>> thierry.onkelinx using inbo.be>
>> > wrote:
>> >
>> >> Dear Mike,
>> >>
>> >> I think you misread my reply. I never stated that there's something
>> wrong
>> >> with the model. The only "problem" I highlighted was your misconception
>> >> about the "high overdispersion". In this case, a high parameter value
>> >> indicates a low variance, which is what we mostly want to see.
>> >>
>> >> Best regards,
>> >>
>> >> ir. Thierry Onkelinx
>> >> Statisticus / Statistician
>> >>
>> >> Vlaamse Overheid / Government of Flanders
>> >> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
>> AND
>> >> FOREST
>> >> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>> >> thierry.onkelinx using inbo.be
>> >> Havenlaan 88 bus 73, 1000 Brussel
>> >> www.inbo.be
>> >>
>> >>
>> >>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> >> To call in the statistician after the experiment is done may be no more
>> >> than asking him to perform a post-mortem examination: he may be able
>> to say
>> >> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>> >> The plural of anecdote is not data. ~ Roger Brinner
>> >> The combination of some data and an aching desire for an answer does
>> not
>> >> ensure that a reasonable answer can be extracted from a given body of
>> data.
>> >> ~ John Tukey
>> >>
>> >>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> >>
>> >> <https://www.inbo.be>
>> >>
>> >>
>> >> Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 using york.ac.uk
>> >:
>> >>
>> >>> Dear Thierry,
>> >>>
>> >>> Thanks for the help. So if the dispersion parameter in this model
>> doesn't
>> >>> fit with the beta distribution, are there any alternative approaches
>> I can
>> >>> use?
>> >>>
>> >>> I can't seem to find much information on this elsewhere other than
>> these
>> >>> two threads:
>> >>> https://stats.stackexchange.com/a/451453/233414
>> >>> https://stats.stackexchange.com/a/466951/233414
>> >>>
>> >>> All the best,
>> >>> Mike
>> >>>
>> >>> On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <
>> thierry.onkelinx using inbo.be>
>> >>> wrote:
>> >>>
>> >>>> Dear Mike,
>> >>>>
>> >>>> The zero-inflation is specified on the logit scale. plogis(-1.18) =
>> >>>> 0.235 23.5% zero seems reasonable when reading your story. (Didn't
>> look at
>> >>>> the data).
>> >>>>
>> >>>> You need to look at the definition for the "over"dispersion
>> parameter.
>> >>>> For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) /
>> (\phi + 1)
>> >>>> (see ?glmmTMB::beta_family) Hence a large value of \phi implies a low
>> >>>> variance.
>> >>>>
>> >>>> Best regards,
>> >>>>
>> >>>> ir. Thierry Onkelinx
>> >>>> Statisticus / Statistician
>> >>>>
>> >>>> Vlaamse Overheid / Government of Flanders
>> >>>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
>> NATURE
>> >>>> AND FOREST
>> >>>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>> >>>> thierry.onkelinx using inbo.be
>> >>>> Havenlaan 88 bus 73, 1000 Brussel
>> >>>> www.inbo.be
>> >>>>
>> >>>>
>> >>>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> >>>> To call in the statistician after the experiment is done may be no
>> more
>> >>>> than asking him to perform a post-mortem examination: he may be able
>> to say
>> >>>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>> >>>> The plural of anecdote is not data. ~ Roger Brinner
>> >>>> The combination of some data and an aching desire for an answer does
>> not
>> >>>> ensure that a reasonable answer can be extracted from a given body
>> of data.
>> >>>> ~ John Tukey
>> >>>>
>> >>>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> >>>>
>> >>>> <https://www.inbo.be>
>> >>>>
>> >>>>
>> >>>> Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <
>> mrml500 using york.ac.uk>:
>> >>>>
>> >>>>> Hi Thierry,
>> >>>>>
>> >>>>> Thank you for your advice and speedy response.
>> >>>>>
>> >>>>> Most of the data is closer to the lower bound (0). e.g. the mean
>> time
>> >>>>> for group A in zone A = 15.1 seconds and group A in zone B = 3.8
>> seconds.
>> >>>>> However there are a very small number of outliers near the upper
>> bound, the
>> >>>>> largest being 294 out of the 300 seconds (see the attached file if
>> you want
>> >>>>> to look at the data).
>> >>>>>
>> >>>>> I have taken a stab at running a Zero-inflated Beta GLMM using
>> glmmTMB
>> >>>>> in R like so:
>> >>>>>
>> >>>>> betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
>> >>>>>                               family = beta_family(),
>> >>>>>                               ziformula=~1,
>> >>>>>                               data = glmm_zone_data)
>> >>>>>
>> >>>>> summary(beta_mod)
>> >>>>>
>> >>>>> *Family: beta  ( logit )*
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> *Formula:          prop_time ~ group * zone + (1 | id)Zero
>> inflation:
>> >>>>>            ~1Data: glmm_zone_data     AIC      BIC   logLik deviance
>> >>>>> df.resid  -763.6   -736.3    388.8   -777.6      359Random
>> >>>>> effects:Conditional model: Groups Name        Variance  Std.Dev. id
>> >>>>> (Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
>> >>>>> 14Overdispersion parameter for beta family (): 13.1Conditional
>> model:
>> >>>>>              Estimate Std. Error z value Pr(>|z|)    (Intercept)
>> >>>>>   -2.7685     0.1031 -26.844  < 2e-16 ***groupB             -0.4455
>> >>>>> 0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524
>> -2.741
>> >>>>> 0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
>> >>>>> ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
>> >>>>> 1Zero-inflation model:            Estimate Std. Error z value
>> Pr(>|z|)
>> >>>>>    (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif.
>> codes:  0
>> >>>>> ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*
>> >>>>>
>> >>>>> Does this look like the correct way of specifying the model? I am a
>> >>>>> little confused about specifying and interpreting the zero-inflation
>> >>>>> component - I have only just begun reading about this.
>> >>>>>
>> >>>>> I noticed that the dispersion parameter is quite high at 13.1. I'm
>> not
>> >>>>> sure if this matters for beta models?. I tried running DHARMa
>> >>>>> simulateResiduals on the model output and got significant
>> deviations in the
>> >>>>> dispersion (<2.2e-16) and KS tests. e.g.
>> >>>>>
>> >>>>> DHARMa::testDispersion(beta_mod)
>> >>>>>
>> >>>>> *DHARMa nonparametric dispersion test via sd of residuals fitted vs.
>> >>>>> simulated*
>> >>>>>
>> >>>>> *data:  simulationOutput*
>> >>>>> *ratioObsSim = 1.3612, p-value < 2.2e-16*
>> >>>>> *alternative hypothesis: two.sided*
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> Many thanks,
>> >>>>> Mike
>> >>>>>
>> >>>>> On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
>> >>>>> thierry.onkelinx using inbo.be> wrote:
>> >>>>>
>> >>>>>> Dear Michael,
>> >>>>>>
>> >>>>>> Your data has bounds (lower bound at 0 and upper bound at 300) and
>> you
>> >>>>>> have a lot of data close to a boundary. In such a case, a
>> continuous
>> >>>>>> distribution which ignores those bound is not a good idea. If the
>> time
>> >>>>>> spent outside of both zones is limited, then a long time in zone A
>> excludes
>> >>>>>> a long time in zone B by definition. Then I'd look towards a
>> multinomial
>> >>>>>> distribution. If the time spent outside both zones is dominant,
>> then you
>> >>>>>> can use a zero-inflated beta as you suggested. A zero-inflated
>> gamma might
>> >>>>>> be OK if the data is not too close to the upper boundary. If you
>> are
>> >>>>>> considering zero-inflated beta vs zero-inflated gamma, then you
>> should
>> >>>>>> choose zero-inflated beta IMHO.
>> >>>>>>
>> >>>>>> Best regards,
>> >>>>>>
>> >>>>>> ir. Thierry Onkelinx
>> >>>>>> Statisticus / Statistician
>> >>>>>>
>> >>>>>> Vlaamse Overheid / Government of Flanders
>> >>>>>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
>> NATURE
>> >>>>>> AND FOREST
>> >>>>>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
>> Assurance
>> >>>>>> thierry.onkelinx using inbo.be
>> >>>>>> Havenlaan 88 bus 73, 1000 Brussel
>> >>>>>> www.inbo.be
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> >>>>>> To call in the statistician after the experiment is done may be no
>> >>>>>> more than asking him to perform a post-mortem examination: he may
>> be able
>> >>>>>> to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
>> >>>>>> The plural of anecdote is not data. ~ Roger Brinner
>> >>>>>> The combination of some data and an aching desire for an answer
>> does
>> >>>>>> not ensure that a reasonable answer can be extracted from a given
>> body of
>> >>>>>> data. ~ John Tukey
>> >>>>>>
>> >>>>>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> >>>>>>
>> >>>>>> <https://www.inbo.be>
>> >>>>>>
>> >>>>>>
>> >>>>>> Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
>> >>>>>> R-sig-mixed-models <r-sig-mixed-models using r-project.org>:
>> >>>>>>
>> >>>>>>> Hello,
>> >>>>>>>
>> >>>>>>> I am new to GLMMs and have a dataset where I have two distinct
>> groups
>> >>>>>>> (A
>> >>>>>>> and B) of 7 individuals each. The data consists of repeated
>> >>>>>>> measurements of
>> >>>>>>> each individual where the amount of time spent at either zone_A or
>> >>>>>>> zone_B
>> >>>>>>> is recorded (out of a total time of 300s/observation period). For
>> >>>>>>> most of
>> >>>>>>> the time period the individuals are in neither zone.
>> >>>>>>>
>> >>>>>>> I want to test if group A and group B spend more time in zone A
>> >>>>>>> compared to
>> >>>>>>> zone B (and vice versa).
>> >>>>>>>
>> >>>>>>> Speaking to someone else, they said I should use a Binomial GLMM
>> using
>> >>>>>>> cbind. i.e.
>> >>>>>>> cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).
>> >>>>>>>
>> >>>>>>> However, the response variable is continuous (albeit with an upper
>> >>>>>>> bound of
>> >>>>>>> 300 seconds per observation period), so I'm not sure if this is
>> >>>>>>> appropriate?
>> >>>>>>>
>> >>>>>>> Should I convert the response into a proportion and use something
>> >>>>>>> like a
>> >>>>>>> Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something
>> like:
>> >>>>>>> prop_time ~ zone*group + (1|id)
>> >>>>>>>
>> >>>>>>> The data is quite heavily right-skewed and contains a lot of 0's,
>> so
>> >>>>>>> reading around it also looks like I may need to convert these
>> into a
>> >>>>>>> zero-inflated/hurdle model?
>> >>>>>>>
>> >>>>>>> Thank you for any suggestions,
>> >>>>>>> Mike
>> >>>>>>>
>> >>>>>>>          [[alternative HTML version deleted]]
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> R-sig-mixed-models using r-project.org mailing list
>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> >>>>>>>
>> >       [[alternative HTML version deleted]]
>> >
>> >
>> >
>> >
>> > ------------------------------
>> >
>> > Subject: Digest Footer
>> >
>> > _______________________________________________
>> > R-sig-mixed-models mailing list
>> > R-sig-mixed-models using r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> >
>> >
>> > ------------------------------
>> >
>> > End of R-sig-mixed-models Digest, Vol 173, Issue 19
>> > ***************************************************
>>
>> --
>> Dr. Alain F. Zuur
>> Highland Statistics Ltd.
>> 9 St Clair Wynd
>> AB41 6DZ Newburgh, UK
>> Email: highstat using highstat.com
>> URL:   www.highstat.com
>>
>> _______________________________________________
>> R-sig-mixed-models using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sim_vs_real_zero_beta_glmm.png
Type: image/png
Size: 3157 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20210520/0645037f/attachment.png>


More information about the R-sig-mixed-models mailing list