[R-sig-ME] Model specification/family for a continuous/proportional response with many zeros
Thierry Onkelinx
th|erry@onke||nx @end|ng |rom |nbo@be
Tue May 18 09:12:38 CEST 2021
Dear Mike,
The zero-inflation is specified on the logit scale. plogis(-1.18) = 0.235
23.5% zero seems reasonable when reading your story. (Didn't look at the
data).
You need to look at the definition for the "over"dispersion parameter. For
a beta distribution is \phi with Var(y) = \mu * (1 - \mu) / (\phi + 1) (see
?glmmTMB::beta_family) Hence a large value of \phi implies a low variance.
Best regards,
ir. Thierry Onkelinx
Statisticus / Statistician
Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx using inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////
<https://www.inbo.be>
Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <mrml500 using york.ac.uk>:
> Hi Thierry,
>
> Thank you for your advice and speedy response.
>
> Most of the data is closer to the lower bound (0). e.g. the mean time for
> group A in zone A = 15.1 seconds and group A in zone B = 3.8 seconds.
> However there are a very small number of outliers near the upper bound, the
> largest being 294 out of the 300 seconds (see the attached file if you want
> to look at the data).
>
> I have taken a stab at running a Zero-inflated Beta GLMM using glmmTMB in
> R like so:
>
> betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
> family = beta_family(),
> ziformula=~1,
> data = glmm_zone_data)
>
> summary(beta_mod)
>
> *Family: beta ( logit )*
>
>
>
>
>
>
> *Formula: prop_time ~ group * zone + (1 | id)Zero inflation:
> ~1Data: glmm_zone_data AIC BIC logLik deviance df.resid
> -763.6 -736.3 388.8 -777.6 359Random effects:Conditional
> model: Groups Name Variance Std.Dev. id (Intercept) 2.386e-09
> 4.885e-05Number of obs: 366, groups: id, 14Overdispersion parameter for
> beta family (): 13.1Conditional model: Estimate Std. Error
> z value Pr(>|z|) (Intercept) -2.7685 0.1031 -26.844 < 2e-16
> ***groupB -0.4455 0.1498 -2.975 0.002932 **zonezone_B
> -0.4179 0.1524 -2.741 0.006124 **groupB:zonezone_B 0.8443
> 0.2190 3.855 0.000116 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
> 0.05 ‘.’ 0.1 ‘ ’ 1Zero-inflation model: Estimate Std. Error z
> value Pr(>|z|) (Intercept) -1.1804 0.1233 -9.575 <2e-16
> ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*
>
> Does this look like the correct way of specifying the model? I am a little
> confused about specifying and interpreting the zero-inflation component - I
> have only just begun reading about this.
>
> I noticed that the dispersion parameter is quite high at 13.1. I'm not
> sure if this matters for beta models?. I tried running DHARMa
> simulateResiduals on the model output and got significant deviations in the
> dispersion (<2.2e-16) and KS tests. e.g.
>
> DHARMa::testDispersion(beta_mod)
>
> *DHARMa nonparametric dispersion test via sd of residuals fitted vs.
> simulated*
>
> *data: simulationOutput*
> *ratioObsSim = 1.3612, p-value < 2.2e-16*
> *alternative hypothesis: two.sided*
>
>
>
> Many thanks,
> Mike
>
> On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <thierry.onkelinx using inbo.be>
> wrote:
>
>> Dear Michael,
>>
>> Your data has bounds (lower bound at 0 and upper bound at 300) and you
>> have a lot of data close to a boundary. In such a case, a continuous
>> distribution which ignores those bound is not a good idea. If the time
>> spent outside of both zones is limited, then a long time in zone A excludes
>> a long time in zone B by definition. Then I'd look towards a multinomial
>> distribution. If the time spent outside both zones is dominant, then you
>> can use a zero-inflated beta as you suggested. A zero-inflated gamma might
>> be OK if the data is not too close to the upper boundary. If you are
>> considering zero-inflated beta vs zero-inflated gamma, then you should
>> choose zero-inflated beta IMHO.
>>
>> Best regards,
>>
>> ir. Thierry Onkelinx
>> Statisticus / Statistician
>>
>> Vlaamse Overheid / Government of Flanders
>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
>> AND FOREST
>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>> thierry.onkelinx using inbo.be
>> Havenlaan 88 bus 73, 1000 Brussel
>> www.inbo.be
>>
>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> To call in the statistician after the experiment is done may be no more
>> than asking him to perform a post-mortem examination: he may be able to say
>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>> The plural of anecdote is not data. ~ Roger Brinner
>> The combination of some data and an aching desire for an answer does not
>> ensure that a reasonable answer can be extracted from a given body of data.
>> ~ John Tukey
>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>>
>> <https://www.inbo.be>
>>
>>
>> Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via R-sig-mixed-models <
>> r-sig-mixed-models using r-project.org>:
>>
>>> Hello,
>>>
>>> I am new to GLMMs and have a dataset where I have two distinct groups (A
>>> and B) of 7 individuals each. The data consists of repeated measurements
>>> of
>>> each individual where the amount of time spent at either zone_A or zone_B
>>> is recorded (out of a total time of 300s/observation period). For most of
>>> the time period the individuals are in neither zone.
>>>
>>> I want to test if group A and group B spend more time in zone A compared
>>> to
>>> zone B (and vice versa).
>>>
>>> Speaking to someone else, they said I should use a Binomial GLMM using
>>> cbind. i.e.
>>> cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).
>>>
>>> However, the response variable is continuous (albeit with an upper bound
>>> of
>>> 300 seconds per observation period), so I'm not sure if this is
>>> appropriate?
>>>
>>> Should I convert the response into a proportion and use something like a
>>> Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something like:
>>> prop_time ~ zone*group + (1|id)
>>>
>>> The data is quite heavily right-skewed and contains a lot of 0's, so
>>> reading around it also looks like I may need to convert these into a
>>> zero-inflated/hurdle model?
>>>
>>> Thank you for any suggestions,
>>> Mike
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-mixed-models using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list