[R-sig-ME] Model specification/family for a continuous/proportional, response with many zeros

Wed May 19 12:18:32 CEST 2021

> Today's Topics:
>
>     1. Re:  Model specification/family for a continuous/proportional
>        response with many zeros (Michael Lawson)
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 19 May 2021 09:31:46 +0100
> From: Michael Lawson <mrml500 using york.ac.uk>
> To: Thierry Onkelinx <thierry.onkelinx using inbo.be>
> Cc: r-sig-mixed-models <r-sig-mixed-models using r-project.org>
> Subject: Re: [R-sig-ME]  Model specification/family for a
> 	continuous/proportional response with many zeros
> Message-ID:
> 	<CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw using mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Thierry,
>
> I understood after your previous message that a high dispersion parameter
> in beta models does not signify overdispersion. Both answers addressed this
> misconception in those two links I provided.
>
> I suppose the real answer I am now looking for is -  how do I assess the
> validity and fit of the model? Are there specifics or assumptions I must
> take into account with zero-inflated beta glmms?

Just simulate 1000 data sets from your model, and see whether the 
simulated data sets are comparable to your original data. One of the 
things you can look at is whether the simulated data sets contain 
similar number of zeros as in the observed data. I'm not sure whether 
the DHARMA package can do this for you. If not..it is easy to program.

See also Figure 8 in (sorry for self-citing):

https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577

Alain

> Thanks for your help,
> Mike
>
> On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <thierry.onkelinx using inbo.be>
> wrote:
>
>> Dear Mike,
>>
>> I think you misread my reply. I never stated that there's something wrong
>> with the model. The only "problem" I highlighted was your misconception
>> about the "high overdispersion". In this case, a high parameter value
>> indicates a low variance, which is what we mostly want to see.
>>
>> Best regards,
>>
>> ir. Thierry Onkelinx
>> Statisticus / Statistician
>>
>> Vlaamse Overheid / Government of Flanders
>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
>> FOREST
>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>> thierry.onkelinx using inbo.be
>> Havenlaan 88 bus 73, 1000 Brussel
>> www.inbo.be
>>
>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> To call in the statistician after the experiment is done may be no more
>> than asking him to perform a post-mortem examination: he may be able to say
>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>> The plural of anecdote is not data. ~ Roger Brinner
>> The combination of some data and an aching desire for an answer does not
>> ensure that a reasonable answer can be extracted from a given body of data.
>> ~ John Tukey
>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>>
>> <https://www.inbo.be>
>>
>>
>> Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 using york.ac.uk>:
>>
>>> Dear Thierry,
>>>
>>> Thanks for the help. So if the dispersion parameter in this model doesn't
>>> fit with the beta distribution, are there any alternative approaches I can
>>> use?
>>>
>>> I can't seem to find much information on this elsewhere other than these
>>> two threads:
>>> https://stats.stackexchange.com/a/451453/233414
>>> https://stats.stackexchange.com/a/466951/233414
>>>
>>> All the best,
>>> Mike
>>>
>>> On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <thierry.onkelinx using inbo.be>
>>> wrote:
>>>
>>>> Dear Mike,
>>>>
>>>> The zero-inflation is specified on the logit scale. plogis(-1.18) =
>>>> 0.235 23.5% zero seems reasonable when reading your story. (Didn't look at
>>>> the data).
>>>>
>>>> You need to look at the definition for the "over"dispersion parameter.
>>>> For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) / (\phi + 1)
>>>> (see ?glmmTMB::beta_family) Hence a large value of \phi implies a low
>>>> variance.
>>>>
>>>> Best regards,
>>>>
>>>> ir. Thierry Onkelinx
>>>> Statisticus / Statistician
>>>>
>>>> Vlaamse Overheid / Government of Flanders
>>>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
>>>> AND FOREST
>>>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>>>> thierry.onkelinx using inbo.be
>>>> Havenlaan 88 bus 73, 1000 Brussel
>>>> www.inbo.be
>>>>
>>>>
>>>> ///////////////////////////////////////////////////////////////////////////////////////////
>>>> To call in the statistician after the experiment is done may be no more
>>>> than asking him to perform a post-mortem examination: he may be able to say
>>>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>>>> The plural of anecdote is not data. ~ Roger Brinner
>>>> The combination of some data and an aching desire for an answer does not
>>>> ensure that a reasonable answer can be extracted from a given body of data.
>>>> ~ John Tukey
>>>>
>>>> ///////////////////////////////////////////////////////////////////////////////////////////
>>>>
>>>> <https://www.inbo.be>
>>>>
>>>>
>>>> Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <mrml500 using york.ac.uk>:
>>>>
>>>>> Hi Thierry,
>>>>>
>>>>> Thank you for your advice and speedy response.
>>>>>
>>>>> Most of the data is closer to the lower bound (0). e.g. the mean time
>>>>> for group A in zone A = 15.1 seconds and group A in zone B = 3.8 seconds.
>>>>> However there are a very small number of outliers near the upper bound, the
>>>>> largest being 294 out of the 300 seconds (see the attached file if you want
>>>>> to look at the data).
>>>>>
>>>>> I have taken a stab at running a Zero-inflated Beta GLMM using glmmTMB
>>>>> in R like so:
>>>>>
>>>>> betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
>>>>>                               family = beta_family(),
>>>>>                               ziformula=~1,
>>>>>                               data = glmm_zone_data)
>>>>>
>>>>> summary(beta_mod)
>>>>>
>>>>> *Family: beta  ( logit )*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Formula:          prop_time ~ group * zone + (1 | id)Zero inflation:
>>>>>            ~1Data: glmm_zone_data     AIC      BIC   logLik deviance
>>>>> df.resid  -763.6   -736.3    388.8   -777.6      359Random
>>>>> effects:Conditional model: Groups Name        Variance  Std.Dev. id
>>>>> (Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
>>>>> 14Overdispersion parameter for beta family (): 13.1Conditional model:
>>>>>              Estimate Std. Error z value Pr(>|z|)    (Intercept)
>>>>>   -2.7685     0.1031 -26.844  < 2e-16 ***groupB             -0.4455
>>>>> 0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524  -2.741
>>>>> 0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
>>>>> ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
>>>>> 1Zero-inflation model:            Estimate Std. Error z value Pr(>|z|)
>>>>>    (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif. codes:  0
>>>>> ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*
>>>>>
>>>>> Does this look like the correct way of specifying the model? I am a
>>>>> little confused about specifying and interpreting the zero-inflation
>>>>> component - I have only just begun reading about this.
>>>>>
>>>>> I noticed that the dispersion parameter is quite high at 13.1. I'm not
>>>>> sure if this matters for beta models?. I tried running DHARMa
>>>>> simulateResiduals on the model output and got significant deviations in the
>>>>> dispersion (<2.2e-16) and KS tests. e.g.
>>>>>
>>>>> DHARMa::testDispersion(beta_mod)
>>>>>
>>>>> *DHARMa nonparametric dispersion test via sd of residuals fitted vs.
>>>>> simulated*
>>>>>
>>>>> *data:  simulationOutput*
>>>>> *ratioObsSim = 1.3612, p-value < 2.2e-16*
>>>>> *alternative hypothesis: two.sided*
>>>>>
>>>>>
>>>>>
>>>>> Many thanks,
>>>>> Mike
>>>>>
>>>>> On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
>>>>> thierry.onkelinx using inbo.be> wrote:
>>>>>
>>>>>> Dear Michael,
>>>>>>
>>>>>> Your data has bounds (lower bound at 0 and upper bound at 300) and you
>>>>>> have a lot of data close to a boundary. In such a case, a continuous
>>>>>> distribution which ignores those bound is not a good idea. If the time
>>>>>> spent outside of both zones is limited, then a long time in zone A excludes
>>>>>> a long time in zone B by definition. Then I'd look towards a multinomial
>>>>>> distribution. If the time spent outside both zones is dominant, then you
>>>>>> can use a zero-inflated beta as you suggested. A zero-inflated gamma might
>>>>>> be OK if the data is not too close to the upper boundary. If you are
>>>>>> considering zero-inflated beta vs zero-inflated gamma, then you should
>>>>>> choose zero-inflated beta IMHO.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> ir. Thierry Onkelinx
>>>>>> Statisticus / Statistician
>>>>>>
>>>>>> Vlaamse Overheid / Government of Flanders
>>>>>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
>>>>>> AND FOREST
>>>>>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>>>>>> thierry.onkelinx using inbo.be
>>>>>> Havenlaan 88 bus 73, 1000 Brussel
>>>>>> www.inbo.be
>>>>>>
>>>>>>
>>>>>> ///////////////////////////////////////////////////////////////////////////////////////////
>>>>>> To call in the statistician after the experiment is done may be no
>>>>>> more than asking him to perform a post-mortem examination: he may be able
>>>>>> to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
>>>>>> The plural of anecdote is not data. ~ Roger Brinner
>>>>>> The combination of some data and an aching desire for an answer does
>>>>>> not ensure that a reasonable answer can be extracted from a given body of
>>>>>> data. ~ John Tukey
>>>>>>
>>>>>> ///////////////////////////////////////////////////////////////////////////////////////////
>>>>>>
>>>>>> <https://www.inbo.be>
>>>>>>
>>>>>>
>>>>>> Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
>>>>>> R-sig-mixed-models <r-sig-mixed-models using r-project.org>:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am new to GLMMs and have a dataset where I have two distinct groups
>>>>>>> (A
>>>>>>> and B) of 7 individuals each. The data consists of repeated
>>>>>>> measurements of
>>>>>>> each individual where the amount of time spent at either zone_A or
>>>>>>> zone_B
>>>>>>> is recorded (out of a total time of 300s/observation period). For
>>>>>>> most of
>>>>>>> the time period the individuals are in neither zone.
>>>>>>>
>>>>>>> I want to test if group A and group B spend more time in zone A
>>>>>>> compared to
>>>>>>> zone B (and vice versa).
>>>>>>>
>>>>>>> Speaking to someone else, they said I should use a Binomial GLMM using
>>>>>>> cbind. i.e.
>>>>>>> cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).
>>>>>>>
>>>>>>> However, the response variable is continuous (albeit with an upper
>>>>>>> bound of
>>>>>>> 300 seconds per observation period), so I'm not sure if this is
>>>>>>> appropriate?
>>>>>>>
>>>>>>> Should I convert the response into a proportion and use something
>>>>>>> like a
>>>>>>> Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something like:
>>>>>>> prop_time ~ zone*group + (1|id)
>>>>>>>
>>>>>>> The data is quite heavily right-skewed and contains a lot of 0's, so
>>>>>>> reading around it also looks like I may need to convert these into a
>>>>>>> zero-inflated/hurdle model?
>>>>>>>
>>>>>>> Thank you for any suggestions,
>>>>>>> Mike
>>>>>>>
>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> R-sig-mixed-models using r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>>>>
> 	[[alternative HTML version deleted]]
>
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> R-sig-mixed-models mailing list
> R-sig-mixed-models using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
> ------------------------------
>
> End of R-sig-mixed-models Digest, Vol 173, Issue 19
> ***************************************************

-- 
Dr. Alain F. Zuur
Highland Statistics Ltd.
9 St Clair Wynd
AB41 6DZ Newburgh, UK
Email: highstat using highstat.com
URL:   www.highstat.com