[R-sig-ME] Model specification/family for a continuous/proportional response with many zeros

Tue May 18 14:56:49 CEST 2021

Dear Mike,

I think you misread my reply. I never stated that there's something wrong
with the model. The only "problem" I highlighted was your misconception
about the "high overdispersion". In this case, a high parameter value
indicates a low variance, which is what we mostly want to see.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx using inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>

Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 using york.ac.uk>:

> Dear Thierry,
>
> Thanks for the help. So if the dispersion parameter in this model doesn't
> fit with the beta distribution, are there any alternative approaches I can
> use?
>
> I can't seem to find much information on this elsewhere other than these
> two threads:
> https://stats.stackexchange.com/a/451453/233414
> https://stats.stackexchange.com/a/466951/233414
>
> All the best,
> Mike
>
> On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <thierry.onkelinx using inbo.be>
> wrote:
>
>> Dear Mike,
>>
>> The zero-inflation is specified on the logit scale. plogis(-1.18) = 0.235
>> 23.5% zero seems reasonable when reading your story. (Didn't look at the
>> data).
>>
>> You need to look at the definition for the "over"dispersion parameter.
>> For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) / (\phi + 1)
>> (see ?glmmTMB::beta_family) Hence a large value of \phi implies a low
>> variance.
>>
>> Best regards,
>>
>> ir. Thierry Onkelinx
>> Statisticus / Statistician
>>
>> Vlaamse Overheid / Government of Flanders
>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
>> AND FOREST
>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>> thierry.onkelinx using inbo.be
>> Havenlaan 88 bus 73, 1000 Brussel
>> www.inbo.be
>>
>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> To call in the statistician after the experiment is done may be no more
>> than asking him to perform a post-mortem examination: he may be able to say
>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>> The plural of anecdote is not data. ~ Roger Brinner
>> The combination of some data and an aching desire for an answer does not
>> ensure that a reasonable answer can be extracted from a given body of data.
>> ~ John Tukey
>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>>
>> <https://www.inbo.be>
>>
>>
>> Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <mrml500 using york.ac.uk>:
>>
>>> Hi Thierry,
>>>
>>> Thank you for your advice and speedy response.
>>>
>>> Most of the data is closer to the lower bound (0). e.g. the mean time
>>> for group A in zone A = 15.1 seconds and group A in zone B = 3.8 seconds.
>>> However there are a very small number of outliers near the upper bound, the
>>> largest being 294 out of the 300 seconds (see the attached file if you want
>>> to look at the data).
>>>
>>> I have taken a stab at running a Zero-inflated Beta GLMM using glmmTMB
>>> in R like so:
>>>
>>> betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
>>>                              family = beta_family(),
>>>                              ziformula=~1,
>>>                              data = glmm_zone_data)
>>>
>>> summary(beta_mod)
>>>
>>> *Family: beta  ( logit )*
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Formula:          prop_time ~ group * zone + (1 | id)Zero inflation:
>>>           ~1Data: glmm_zone_data     AIC      BIC   logLik deviance
>>> df.resid  -763.6   -736.3    388.8   -777.6      359Random
>>> effects:Conditional model: Groups Name        Variance  Std.Dev. id
>>> (Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
>>> 14Overdispersion parameter for beta family (): 13.1Conditional model:
>>>             Estimate Std. Error z value Pr(>|z|)    (Intercept)
>>>  -2.7685     0.1031 -26.844  < 2e-16 ***groupB             -0.4455
>>> 0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524  -2.741
>>> 0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
>>> ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
>>> 1Zero-inflation model:            Estimate Std. Error z value Pr(>|z|)
>>>   (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif. codes:  0
>>> ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*
>>>
>>> Does this look like the correct way of specifying the model? I am a
>>> little confused about specifying and interpreting the zero-inflation
>>> component - I have only just begun reading about this.
>>>
>>> I noticed that the dispersion parameter is quite high at 13.1. I'm not
>>> sure if this matters for beta models?. I tried running DHARMa
>>> simulateResiduals on the model output and got significant deviations in the
>>> dispersion (<2.2e-16) and KS tests. e.g.
>>>
>>> DHARMa::testDispersion(beta_mod)
>>>
>>> *DHARMa nonparametric dispersion test via sd of residuals fitted vs.
>>> simulated*
>>>
>>> *data:  simulationOutput*
>>> *ratioObsSim = 1.3612, p-value < 2.2e-16*
>>> *alternative hypothesis: two.sided*
>>>
>>>
>>>
>>> Many thanks,
>>> Mike
>>>
>>> On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <thierry.onkelinx using inbo.be>
>>> wrote:
>>>
>>>> Dear Michael,
>>>>
>>>> Your data has bounds (lower bound at 0 and upper bound at 300) and you
>>>> have a lot of data close to a boundary. In such a case, a continuous
>>>> distribution which ignores those bound is not a good idea. If the time
>>>> spent outside of both zones is limited, then a long time in zone A excludes
>>>> a long time in zone B by definition. Then I'd look towards a multinomial
>>>> distribution. If the time spent outside both zones is dominant, then you
>>>> can use a zero-inflated beta as you suggested. A zero-inflated gamma might
>>>> be OK if the data is not too close to the upper boundary. If you are
>>>> considering zero-inflated beta vs zero-inflated gamma, then you should
>>>> choose zero-inflated beta IMHO.
>>>>
>>>> Best regards,
>>>>
>>>> ir. Thierry Onkelinx
>>>> Statisticus / Statistician
>>>>
>>>> Vlaamse Overheid / Government of Flanders
>>>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
>>>> AND FOREST
>>>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>>>> thierry.onkelinx using inbo.be
>>>> Havenlaan 88 bus 73, 1000 Brussel
>>>> www.inbo.be
>>>>
>>>>
>>>> ///////////////////////////////////////////////////////////////////////////////////////////
>>>> To call in the statistician after the experiment is done may be no more
>>>> than asking him to perform a post-mortem examination: he may be able to say
>>>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>>>> The plural of anecdote is not data. ~ Roger Brinner
>>>> The combination of some data and an aching desire for an answer does
>>>> not ensure that a reasonable answer can be extracted from a given body of
>>>> data. ~ John Tukey
>>>>
>>>> ///////////////////////////////////////////////////////////////////////////////////////////
>>>>
>>>> <https://www.inbo.be>
>>>>
>>>>
>>>> Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
>>>> R-sig-mixed-models <r-sig-mixed-models using r-project.org>:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am new to GLMMs and have a dataset where I have two distinct groups
>>>>> (A
>>>>> and B) of 7 individuals each. The data consists of repeated
>>>>> measurements of
>>>>> each individual where the amount of time spent at either zone_A or
>>>>> zone_B
>>>>> is recorded (out of a total time of 300s/observation period). For most
>>>>> of
>>>>> the time period the individuals are in neither zone.
>>>>>
>>>>> I want to test if group A and group B spend more time in zone A
>>>>> compared to
>>>>> zone B (and vice versa).
>>>>>
>>>>> Speaking to someone else, they said I should use a Binomial GLMM using
>>>>> cbind. i.e.
>>>>> cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).
>>>>>
>>>>> However, the response variable is continuous (albeit with an upper
>>>>> bound of
>>>>> 300 seconds per observation period), so I'm not sure if this is
>>>>> appropriate?
>>>>>
>>>>> Should I convert the response into a proportion and use something like
>>>>> a
>>>>> Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something like:
>>>>> prop_time ~ zone*group + (1|id)
>>>>>
>>>>> The data is quite heavily right-skewed and contains a lot of 0's, so
>>>>> reading around it also looks like I may need to convert these into a
>>>>> zero-inflated/hurdle model?
>>>>>
>>>>> Thank you for any suggestions,
>>>>> Mike
>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-mixed-models using r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>>
>>>>

	[[alternative HTML version deleted]]