[R-sig-ME] zero-inflation and multimodal count distribution
Ben Bolker
bbolker at gmail.com
Wed May 3 20:06:43 CEST 2017
[Please keep r-sig-mixed-models at r-project.org in the Cc: ...]
On Wed, May 3, 2017 at 1:43 PM, simone santoro
<simonesantoro77 at gmail.com> wrote:
> Thank you for observing that, it makes much sense!
> I have being exploring better the data and have realized that the response
> variable varies greatly among years in a way that might explain why the
> multimodal marginal distribution. I have also tried different combinations
> of random intercepts/slopes and have found that the best fit is achieved by
> defining nest identity and year as (independent) random intercepts
> {(1|NF)+(1|YEAR)}. Now, I would refine my modelling approach and have tried
> to model these data in a different way.
>
> As told in the previous mail, actually, I am interested in testing whether
> the ratio of sons relative to daughters contribution in terms of number of
> descendants (nDesc) depends on the non-additive effect of parental quality
> (parQuality). Then, I have tried a zero-inflated binomial and betabinomial
> family. In these cases I have considered as response variable the number of
> descendants generated by each individual (son or daughter), i.e. nDesc,
> relative to the sum of descendants (sumDesc) generated by the whole nest
> (sons + daughters). My response variable is therefore: nDesc/sumDesc
>
> I would think that a betabinomial distribution might be better because I
> know that the response variable has lot of zeros and lot of 1. By the way:
> do exist zero-one-inflated beta (mixed) models as that discussed here?
>
> Thus I try the betabinomial first:
> tmbBetaBinomial<- glmmTMB(nDesc/sumDesc ~ parQuality*sex+(1|NF)+(1|YEAR),…,
> zi~1,family= list(family="betabinomial",link="logit"), weights= sumDesc)
> and I get this warning message:
>
> Warning message:
> In nlminb(start = par, objective = fn, gradient = gr) :
> NA/NaN function evaluation
This is probably a harmless error message. It just means that the
optimizer wandered into illegal territory somewhere along the way.
>
>
> Then I try the binomial:
> tmbBinomial<- glmmTMB(nDesc/sumDesc ~ parQuality*sex+(1|NF)+(1|YEAR),…,
> zi~1,family= binomial, weights= sumDesc)
> I get no warning messages.
>
> This is the summary of the betabinomial glmm:
>
>>summary(tmbBetaBinomial)
> Family: betabinomial ( logit )
> Formula: nDesc/sumDesc ~ parQuality * sex + (1 | NF) + (1 | YEAR)
> Zero inflation: ~1
> Data: ind
> Weights: sumDesc
>
> AIC BIC logLik deviance df.resid
> 5939.0 5983.4 -2961.5 5923.0 1886
>
> Random effects:
>
> Conditional model:
> Groups Name Variance Std.Dev.
> NF (Intercept) 4.130e-09 6.426e-05
> YEAR (Intercept) 1.328e+00 1.152e+00
> Number of obs: 1894, groups: NF, 719; YEAR, 24
>
> Overdispersion parameter for betabinomial family (): 0.38
>
> Conditional model:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 0.099863 0.355641 0.281 0.779
> parQuality 0.001557 0.004606 0.338 0.735
> sexmal 0.002171 0.350213 0.006 0.995
> parQuality:sexmal 0.000784 0.006104 0.128 0.898
>
> Zero-inflation model:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -27.8 64568.8 0 1
>
>
> There is a strange z value for the zero-inflation model, what could it mean?
It means that the model tried to reduce the zero-inflation component
to zero. It fits the zero-inflation probability on a logit scale, so
it can't make it *exactly* zero, but it can make it very very close.
Note that the NF variance is also effectively zero (std dev 10^5 times
smaller than the YEAR variance and about 10 times smaller than the
smallest-magnitude fixed effect). The AIC/log-likelihood for this
model are much better than for the binomial GLMM, so (alas) I would
advise you to take this model instead.
In any case I think you should try plotting (or otherwise
inspecting) the predictions from both models, to understand what
they're saying.
>
> This is the summary of the binomial glmm:
>
> Family: binomial ( logit )
> Formula: nDesc/sumDesc ~ parQuality * sex + (1 | NF) + (1 | YEAR)
> Zero inflation: ~1
> Data: ind
> Weights: sumDesc
>
> AIC BIC logLik deviance df.resid
> 6841.9 6880.8 -3414.0 6827.9 1887
>
> Random effects:
>
> Conditional model:
> Groups Name Variance Std.Dev.
> NF (Intercept) 3.646e+01 6.0381644
> YEAR (Intercept) 1.090e-08 0.0001044
> Number of obs: 1894, groups: NF, 719; YEAR, 24
>
> Conditional model:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 7.386051 0.877055 8.421 < 2e-16 ***
> parQuality -0.007452 0.013641 -0.546 0.58483
> sexmal -0.649383 0.235775 -2.754 0.00588 **
> parQuality:sexmal 0.009153 0.004007 2.284 0.02235 *
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Zero-inflation model:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -0.03268 0.04625 -0.706 0.48
>
> This would make biological sense to me because I would expect that sons
> ("sexmal") are advantaged relative to their sisters when the parental
> quality is high. However, may I trust this result? is there something that
> you see I am missing?
> Best,
>
> Simone
>
>
>
>
>
> 2017-05-02 18:31 GMT+02:00 Ben Bolker <bbolker at gmail.com>:
>>
>>
>>
>> On 17-05-02 12:20 PM, simone santoro wrote:
>> > Hi all,
>> >
>> > I am trying to test a hypothesis regarding the different contribution of
>> > sons and daughters to parents’ fitness. I have a number of (bird) nests
>> > of
>> > which I have measured a feature of parents related to their quality
>> > (continuous variable) that I hypothesize affects the future lifetime
>> > fecundity of their sons and daughters.
>> >
>> > Specifically, my hypothesis is that at high values of parents’ quality
>> > sons
>> > will be more fecund than sisters through their entire life and vice
>> > versa,
>> > at low values of parents’ quality, daughters will be more than brothers.
>> >
>> >
>> > Note that sons and daughters of a nest, of which I have recorded their
>> > lifetime fecundity, are born all the same year. Thus, year of birth (of
>> > sons and daughters) is a random intercept I want to control for as it is
>> > the nest identity. The data set may be arranged in two ways, one that
>> > considers a row for each nest and another that considers a row for each
>> > offspring (son or daughter).
>> >
>> >
>> > In case 1 (row = nest), I have these variables: FN, family name; YEAR,
>> > birth year of sons and daughter; nDescBySons, lifetime total number of
>> > progeny generated by sons (pooled); nDescByDaughs, lifetime total
>> > number
>> > of progeny generated by daughters (pooled); nSons, number of sons;
>> > nDaughs,
>> > number of daughters; parQuality, parents’ quality.
>> >
>> > In case 2 (row = son or daughter), I have these variables: FN, family
>> > name;
>> > YEAR, birth year of sons and daughter; nDesc, lifetime total number of
>> > progeny generated by the individual; sex, son or daughter; nestSize,
>> > total
>> > number of sons and daughters at nest; parQuality, parents’ quality.
>> >
>> >
>> > In a way, I think that the second arrangement of data is easier to be
>> > analyzed for testing my hypothesis (comment/suggestion on this?). In
>> > this
>> > way I have direct information on the individual-level lifetime fecundity
>> > of
>> > sons and daughters and have not necessarily to take care of how many
>> > sons
>> > and daughters were at the nest.
>> > However, I have lot of zeros (many sons and daughters disappear – die or
>> > emigrate - and have no recorded descendants at all) and data have a kind
>> > of
>> > bimodal distribution after the zero mode (see below image):
>> >
>> > https://drive.google.com/open?id=0BwsTfIcebsrOZnljSW9uQXF2UU0
>> >
>> >
>> > Thus, I would use a zero-inflated GLMM as, for instance, by using
>> > glmmTMB
>> > package in R. Something like this:
>> >
>> > glmmTMB(nDesc ~ parQuality*sex+(1|NF)+(1|YEAR),…, zi~1)
>> >
>> > But, what about that ‘ugly’ multimodal distribution? I thought I may try
>> > different distributions (e.g. poisson, compois, any other?) and compare
>> > the
>> > model fit by looking at the AIC.
>> >
>> > Any advice on this would be extremely appreciated.
>> >
>> >
>> > Simone
>> >
>>
>> My main thought is that your plots show the *marginal* distribution
>> of the data. Differences among families/years or odd shapes of the
>> parental quality distribution could drive this pattern without any need
>> to assume the *conditional* distribution is multimodal. Fit a sensible
>> model (like the one you suggest) and then check diagnostics in various
>> ways (if you have enough data, you could consider interactions between
>> sex and parental quality and the random effects -- e.g. does parental
>> quality matter more in some birth years than others?)
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
More information about the R-sig-mixed-models
mailing list