[R-sig-ME] Predictions from zero-inflated or hurdle models

Jarrod Hadfield j.hadfield at ed.ac.uk
Wed Mar 18 07:06:43 CET 2015


Hi,

6/ you are correct - my mistake.


5b/ I'm still not sure what you mean by the uncertainty, but ignoring  
the random effects will result in under prediction. For a Poisson  
log-normal the mean  is  exp(a_1+v_1/2)  rather than  exp(a_1) for the  
standard Poisson.

Cheers,

Jarrod



Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015  
19:21:06 +0000:

> Hi,
>
> two more questions, sorry.
>
> 6.
> could it be that where you wrote
> normal.evd<-function(x, mu, v){
> exp(-exp(x))*dnorm(x, mu, sqrt(v))
> }
>
> you're not actually passing the variance but the standard deviation? so the
> functions should be
>
> normal.evd<-function(x, mu, sd){
> exp(-exp(x))*dnorm(x, mu, sd)
> }
> normal.zt<-function(x, mu, sd){
> exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sd)
> }
>
> Because at least in the example, you were taking the square root here
> already: sd_1<-mean(sqrt(m1$VCV[,1]))
> Maybe this was a copy-paste from the source where you do it differently? Or
> I misunderstood...
>
> An addition to question 5:
> 5. When I use the correct inverse link function in my old code, but don't
> do double integration, I do get a "reasonable" amount of uncertainty around
> the reference factor. The predicted means are too low, which I put down to
> this approach ignoring the VCV. I was just wondering how this could happen
> (actually seeing the "reasonable" amount of uncertainty was what made me
> hopeful that my approach might not be entirely wrong).
>
> Best,
>
> Ruben
>
>
> On Tue, Mar 17, 2015 at 7:53 PM Ruben Arslan <rubenarslan at gmail.com> wrote:
>
>> Hi Jarrod,
>>
>> thanks for the extensive reply! This helps a lot, though it sounds like I
>> was hubristic to attempt this myself.
>> I tried using the approach you mapped out in the function gist
>> <https://gist.github.com/rubenarslan/aeacdd306b3d061819a6> I posted. I
>> simply put the pred function in a loop, so that I wouldn't make any
>> mistakes while vectorising and since I don't care about performance at this
>> point.
>>
>> Of course, I have some follow up questions though.. I'm sorry if I'm being
>> a pain, I really appreciate the free advice, but understand of course if
>> other things take precedence.
>>
>> 1. You're not "down with" developing publicly are you? Because I sure
>> would like to test-drive the newdata prediction and simulation functions..
>>
>> 2. Could you make sure that I got this right: "When predictions are to be
>> taken after marginalising the random effects (including the `residual'
>> over-dispersion) it is not possible to obtain closed form expressions."
>> That is basically my scenario, right? In the example I included, I also
>> had a group-level random effect (family). Ore are you talking about the
>> "trait" as the random effect (as in your example) and my scenario is
>> different and I cannot apply the numerical double integration procedure you
>> posted?
>> To be clear about my prediction goal without using language that I might
>> be using incorrectly: I want to show what the average effect in the
>> response unit, number of children, is in my population(s). I have data on
>> whole populations and am using all of it (except individuals that don't
>> have completed fertility yet, because I have yet to find a way to model
>> both zero-inflation and right censoring).
>>
>> 3. "Numerical integration could be extended to double integration in
>> which case covariance between the Poisson part and the binary part could
>> be handled." That is what you posted an example of and it applies to my
>> scenario, because I specified a prior R=list(V=diag(2), nu=1.002, fix=2)
>> and rcov=~idh(trait):units, random=~idh(trait):idParents?
>> But this double integration approach is something you just wrote
>> off-the-cuff and I probably shouldn't use it in a publication? Or is this
>> in the forthcoming MCMCglmm release and I might actually be able to refer
>> to it once I get to submitting?
>>
>> 4. Could I change my model specification to forbid covariance between the
>> two parts and not shoot myself in the foot? Would this allow for a more
>> valid/tested approach to prediction?
>>
>> 5. When I use your method on my real data, I get less variation around the
>> prediction "for the reference level" than for all other factor levels.
>> My reference level actually has fewer cases than the others, so this
>> isn't "right" in a way.
>> Is this because I'm not doing newdata prediction? I get the "right"
>> looking uncertainty if I bootstrap newdata predictions in lme4,
>> Sorry if this is children's logic :-)
>> Here's an image of the prediction
>> <http://rpubs.com/rubenarslan/mcmcglmm_pred> and the raw data
>> <http://rpubs.com/rubenarslan/raw>.
>>
>> Many thanks for any answers that you feel inclined to give.
>>
>> Best wishes,
>>
>> Ruben
>>
>> On Tue, Mar 17, 2015 at 5:31 PM Jarrod Hadfield <j.hadfield at ed.ac.uk>
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry - I should have replied to this post earlier. I've been working
>>> on predict/simulate methods for all MCMcglmm models (including
>>> zero-inflated/altered/hurdle/truncated models) and these are now
>>> largely complete (together with newdata options).
>>>
>>> When predictions are to be taken after marginalising the random
>>> effects (including the `residual' over-dispersion) it is not possible
>>> to obtain closed form expressions. The options that will be available
>>> in MCMCglmm are:
>>>
>>> 1) algebraic approximation
>>> 2) numerical integration
>>> 3) simulation
>>>
>>> 1) and 2) are currently only accurate when the random/residual effect
>>> structure implies no covariance between the Poisson part and the
>>> binary part.
>>>
>>> 1) is reasonably accurate for zero-inflated distributions, but can be
>>> pretty poor for the remainder because they all involve zero-truncated
>>> Poisson log-normal distributions and my taylor approximation for the
>>> mean is less than ideal (any suggestions would be helpful).
>>>
>>> 2) could be extended to double integration in which case covariance
>>> between the Poisson part and the binary part could be handled.
>>>
>>> In your code, part of the problem is that you have fitted a zapoisson,
>>> but the prediction is based on a zipoisson (with complementary log-log
>>> link rather than logt link).
>>>
>>> In all zero-inflated/altered/hurdle/truncated models
>>>
>>> E[y] = E[(1-prob)*meanc]
>>>
>>> where prob is the probabilty of a zero in the binary part and meanc is
>>> the mean of a Poisson distribution (zipoisson) or a zero-truncated
>>> poisson (zapoisson and hupoisson). If we have eta_1 as the linear
>>> predictor for the poisson part and eta_2 as the linear predictor for
>>> the binary part:
>>>
>>> In zipoisson: prob = plogis(eta_2)     and meanc = exp(eta_1)
>>> In zapoisson: prob = exp(-exp(eta_2))  and meanc =
>>> exp(eta_1)/(1-exp(-exp(eta_1)))
>>> In hupoisson: prob = plogis(eta_2)     and meanc =
>>> exp(eta_1)/(1-exp(-exp(eta_1)))
>>> In ztpoisson: prob = 0                 and meanc =
>>> exp(eta_1)/(1-exp(-exp(eta_1)))
>>>
>>> In each case the linear predictor has a `fixed' part and a `random'
>>> part which I'll denote as `a' and `u' respectively. Ideally we would
>>> want
>>>
>>> E[(1-prob)*meanc] taken over u_1 & u_2
>>>
>>> if prob and meanc are independent this is a bit easier as
>>>
>>> E[y] = E[1-prob]E[meanc]
>>>
>>> and the two expectations ony need to taken with repsect to their
>>> respective random effects. If we have sd_1 and sd_2 as the standard
>>> deviations of the two sets of random effects then for the zapoisson:
>>>
>>>    normal.evd<-function(x, mu, v){
>>>       exp(-exp(x))*dnorm(x, mu, sqrt(v))
>>>    }
>>>    normal.zt<-function(x, mu, v){
>>>      exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
>>>    }
>>>
>>>    pred<-function(a_1, a_2, sd_1, sd_2){
>>>      prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),
>>> qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
>>>      meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),
>>> qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
>>>      prob*meanc
>>>    }
>>>
>>> #  gives the expected value with reasonable accuracy.  As an example:
>>>
>>>    x<-rnorm(300)
>>>    l1<-rnorm(300, 1/2+x, sqrt(1))
>>>    l2<-rnorm(300, 1-x, sqrt(1))
>>>
>>>    y<-rbinom(300, 1, 1-exp(-exp(l2)))
>>>    y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,
>>> exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
>>>    # cunning sampler from Peter Dalgaard (R-sig-mixed)
>>>
>>>    data=data.frame(y=y, x=x)
>>>    prior=list(R=list(V=diag(2), nu=0.002, fix=2))
>>>
>>>    m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,
>>> family="zapoisson", prior=prior)
>>>
>>>    b_1<-colMeans(m1$Sol)[c(1,3)]
>>>    b_2<-colMeans(m1$Sol)[c(2,4)]
>>>    sd_1<-mean(sqrt(m1$VCV[,1]))
>>>    sd_2<-mean(sqrt(m1$VCV[,2]))
>>>
>>>    # note it is more accurate to take the posterior mean prediction
>>> rather than the prediction from the posterior means as I've done here,
>>> but for illustration:
>>>
>>>    x.pred<-seq(-3,3,length=100)
>>>    p<-1:100
>>>    for(i in 1:100){
>>>      p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =
>>> b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
>>>    }
>>>
>>>    plot(y~x)
>>>    lines(p~x.pred)
>>>
>>> Cheers,
>>>
>>> Jarrod
>>>
>>>
>>>
>>>
>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
>>> 13:33:25 +0000:
>>>
>>> > Dear list,
>>> >
>>> > I've made a reproducible example of the zero-altered prediction,
>>> > in the hope that someone will have a look and reassure me that I'm going
>>> > about this the right way.
>>> > I was a bit confused by the point about p_i and n_i being correlated
>>> (they
>>> > are in my case), but I think this was a red herring for me
>>> > since I'm not deriving the variance analytically.
>>> > The script is here: https://gist.github.com/rubenarslan/
>>> aeacdd306b3d061819a6
>>> > and if you don't want to run the simulation fit yourself, I've put an
>>> RDS
>>> > file of the fit here:
>>> > https://dl.dropboxusercontent.com/u/1078620/m1.rds
>>> >
>>> > Best regards,
>>> >
>>> > Ruben Arslan
>>> >
>>> > On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com>
>>> wrote:
>>> >
>>> >> Dear Dr Bolker,
>>> >>
>>> >> I'd thought about something like this, one point of asking was to
>>> >> see whether
>>> >> a) it's implemented already, because I'll probably make dumb mistakes
>>> while
>>> >> trying b) it's not implemented because it's a bad idea.
>>> >> Your response and the MCMCglmm course notes make me hope that it's c)
>>> not
>>> >> implemented because nobody did yet or d) it's so simple that everybody
>>> does
>>> >> it on-the-fly.
>>> >>
>>> >> So I tried my hand and would appreciate corrections. I am sure there is
>>> >> some screw-up or an inelegant approach in there.
>>> >> I included code for dealing with mcmc.lists because that's what I have
>>> and
>>> >> I'm not entirely sure how I deal with them is correct either.
>>> >>
>>> >> I started with a zero-altered model, because those fit fastest and
>>> >> according to the course notes have the least complex likelihood.
>>> >> Because I know not what I do, I'm not dealing with my random effects at
>>> >> all.
>>> >>
>>> >> I pasted a model summary below to show what I've applied the below
>>> >> function to. The function gives the following quantiles when applied
>>> to 19
>>> >> chains of that model.
>>> >>          5%         50%         95%
>>> >> 5.431684178 5.561211207 5.690655200
>>> >>          5%         50%         95%
>>> >> 5.003974382 5.178192327 5.348246558
>>> >>
>>> >> Warm regards,
>>> >>
>>> >> Ruben Arslan
>>> >>
>>> >> HPDpredict_za = function(object, predictor) {
>>> >>
>>> >> if(class(object) != "MCMCglmm") {
>>> >> if(length( object[[1]]$Residual$nrt )>1) {
>>> >> object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
>>> >> }
>>> >> Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
>>> >> vars = colnames(Sol[[1]])
>>> >> } else {
>>> >> Sol = as.data.frame(object$Sol)
>>> >> vars = names(Sol)
>>> >> }
>>> >> za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
>>> >> "traitza_"]
>>> >> za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
>>> >> "traitza_"]
>>> >>  intercept = Sol[,"(Intercept)"]
>>> >> za_intercept = Sol[, za_intercept_name]
>>> >> l1 = Sol[, predictor ]
>>> >> l2 = Sol[, za_predictor ]
>>> >> if(is.list(object)) {
>>> >> intercept = unlist(intercept)
>>> >> za_intercept = unlist(za_intercept)
>>> >> l1 = unlist(l1)
>>> >> l2 = unlist(l2)
>>> >> }
>>> >>  py_0 = dpois(0, exp(intercept + za_intercept))
>>> >> y_ygt0 = exp(intercept)
>>> >> at_intercept = (1-py_0) * y_ygt0
>>> >>
>>> >> py_0 = dpois(0, exp(intercept + za_intercept + l2))
>>> >> y_ygt0 = exp(intercept +  l1)
>>> >> at_predictor_1 = (1-py_0) * y_ygt0
>>> >> print(qplot(at_intercept))
>>> >> print(qplot(at_predictor_1))
>>> >> df = data.frame("intercept" = at_intercept)
>>> >> df[, predictor] = at_predictor_1
>>> >> print(qplot(x=variable,
>>> >> y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40),
>>> geom =
>>> >> 'violin'))
>>> >> print(quantile(at_intercept, probs = c(0.05,0.5,0.95)))
>>> >> print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95)))
>>> >> invisible(df)
>>> >> }
>>> >>
>>> >>
>>> >> > summary(object[[1]])
>>> >>
>>> >>  Iterations = 100001:299901
>>> >>  Thinning interval  = 100
>>> >>  Sample size  = 2000
>>> >>
>>> >>  DIC: 349094
>>> >>
>>> >>  G-structure:  ~idh(trait):idParents
>>> >>
>>> >>                       post.mean l-95% CI u-95% CI eff.samp
>>> >> children.idParents       0.0189   0.0164   0.0214     1729
>>> >> za_children.idParents    0.2392   0.2171   0.2622     1647
>>> >>
>>> >>  R-structure:  ~idh(trait):units
>>> >>
>>> >>                   post.mean l-95% CI u-95% CI eff.samp
>>> >> children.units        0.144    0.139    0.148     1715
>>> >> za_children.units     1.000    1.000    1.000        0
>>> >>
>>> >>  Location effects: children ~ trait * (maternalage.factor +
>>> paternalloss +
>>> >> maternalloss + center(nr.siblings) + birth.cohort + urban + male +
>>> >> paternalage.mean + paternalage.diff)
>>> >>
>>> >>                                            post.mean  l-95% CI  u-95%
>>> CI
>>> >> eff.samp   pMCMC
>>> >> (Intercept)                                 2.088717  2.073009
>>> 2.103357
>>> >>   2000 <0.0005 ***
>>> >> traitza_children                           -1.933491 -1.981945
>>> -1.887863
>>> >>   2000 <0.0005 ***
>>> >> maternalage.factor(14,20]                   0.007709 -0.014238
>>> 0.027883
>>> >>   1500   0.460
>>> >> maternalage.factor(35,50]                   0.006350 -0.009634
>>> 0.024107
>>> >>   2000   0.462
>>> >> paternallossTRUE                            0.000797 -0.022716
>>> 0.025015
>>> >>   2000   0.925
>>> >> maternallossTRUE                           -0.015542 -0.040240
>>> 0.009549
>>> >>   2000   0.226
>>> >> center(nr.siblings)                         0.005869  0.004302
>>> 0.007510
>>> >>   2000 <0.0005 ***
>>> >> birth.cohort(1703,1722]                    -0.045487 -0.062240
>>> -0.028965
>>> >>   2000 <0.0005 ***
>>> >> birth.cohort(1722,1734]                    -0.055872 -0.072856
>>> -0.036452
>>> >>   2000 <0.0005 ***
>>> >> birth.cohort(1734,1743]                    -0.039770 -0.056580
>>> -0.020907
>>> >>   2000 <0.0005 ***
>>> >> birth.cohort(1743,1750]                    -0.030713 -0.048301
>>> -0.012214
>>> >>   2000   0.002 **
>>> >> urban                                      -0.076748 -0.093240
>>> -0.063002
>>> >>   2567 <0.0005 ***
>>> >> male                                        0.106074  0.095705
>>> 0.115742
>>> >>   2000 <0.0005 ***
>>> >> paternalage.mean                           -0.024119 -0.033133
>>> -0.014444
>>> >>   2000 <0.0005 ***
>>> >> paternalage.diff                           -0.018367 -0.032083
>>> -0.005721
>>> >>   2000   0.007 **
>>> >> traitza_children:maternalage.factor(14,20] -0.116510 -0.182432
>>> -0.051978
>>> >>   1876   0.001 ***
>>> >> traitza_children:maternalage.factor(35,50] -0.045196 -0.094485
>>> 0.002640
>>> >>   2000   0.075 .
>>> >> traitza_children:paternallossTRUE          -0.171957 -0.238218
>>> -0.104820
>>> >>   2000 <0.0005 ***
>>> >> traitza_children:maternallossTRUE          -0.499539 -0.566825
>>> -0.430637
>>> >>   2000 <0.0005 ***
>>> >> traitza_children:center(nr.siblings)       -0.023723 -0.028676
>>> -0.018746
>>> >>   1848 <0.0005 ***
>>> >> traitza_children:birth.cohort(1703,1722]   -0.026012 -0.074250
>>> 0.026024
>>> >>   2000   0.319
>>> >> traitza_children:birth.cohort(1722,1734]   -0.279418 -0.329462
>>> -0.227187
>>> >>   2000 <0.0005 ***
>>> >> traitza_children:birth.cohort(1734,1743]   -0.260165 -0.312659
>>> -0.204462
>>> >>   2130 <0.0005 ***
>>> >> traitza_children:birth.cohort(1743,1750]   -0.481457 -0.534568
>>> -0.426648
>>> >>   2000 <0.0005 ***
>>> >> traitza_children:urban                     -0.604108 -0.645169
>>> -0.562554
>>> >>   1702 <0.0005 ***
>>> >> traitza_children:male                      -0.414988 -0.444589
>>> -0.387005
>>> >>   2000 <0.0005 ***
>>> >> traitza_children:paternalage.mean           0.006545 -0.018570
>>> 0.036227
>>> >>   2000   0.651
>>> >> traitza_children:paternalage.diff          -0.097982 -0.136302
>>> -0.060677
>>> >>   2000 <0.0005 ***
>>> >>
>>> >>
>>> >> On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
>>> >>
>>> >>> Ruben Arslan <rubenarslan at ...> writes:
>>> >>>
>>> >>> >
>>> >>> > Dear list,
>>> >>> >
>>> >>> > I wanted to ask: Is there any (maybe just back of the envelope) way
>>> to
>>> >>> > obtain a response prediction for zero-inflated or hurdle type
>>> models?
>>> >>> > I've fit such models in MCMCglmm, but I don't work in ecology and my
>>> >>> > previous experience with explaining such models to "my audience"
>>> did not
>>> >>> > bode well. When it comes to humans, the researchers I presented to
>>> are
>>> >>> not
>>> >>> > used to offspring count being zero-inflated (or acquainted with that
>>> >>> > concept), but in my historical data with high infant mortality, it
>>> is
>>> >>> (in
>>> >>> > modern data it's actually slightly underdispersed).
>>> >>> >
>>> >>> > Currently I'm using lme4 and simply splitting my models into two
>>> stages
>>> >>> > (finding a mate and having offspring).
>>> >>> > That's okay too, but in one population the effect of interest is not
>>> >>> > clearly visible in either stage, only when both are taken together
>>> (but
>>> >>> > then the outcome is zero-inflated).
>>> >>> > I expect to be given a hard time for this and hence thought I'd use
>>> a
>>> >>> > binomial model with the outcome offspring>0 as my main model, but
>>> that
>>> >>> > turns out to be hard to explain too and doesn't
>>> >>> > really do the data justice.
>>> >>> >
>>> >>> > Basically I don't want to be forced to discuss my smallest
>>> population
>>> >>> as a
>>> >>> > non-replication of the effect because I was insufficiently able to
>>> >>> explain
>>> >>> > the statistics behind my reasoning that the effect shows.
>>> >>>
>>> >>>   I think the back-of-the envelope answer would be that for a
>>> two-stage
>>> >>> model with a prediction of p_i for the probability of having a
>>> non-zero
>>> >>> response (or in the case of zero-inflated models, the probability of
>>> >>> _not_ having a structural zero) and a prediction of n_i for the
>>> >>> conditional
>>> >>> part of the model, the mean predicted value is p_i*n_i and the
>>> >>> variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 +
>>> var(n_i)/n_i^2)
>>> >>> (this is assuming
>>> >>> that you haven't built in any correlation between p_i and n_i, which
>>> >>> would be hard in lme4 but _might_ be possible under certain
>>> circumstances
>>> >>> via a multitype model in MCMCglmm).
>>> >>>
>>> >>>   Does that help?
>>> >>>
>>> >>> _______________________________________________
>>> >>> R-sig-mixed-models at r-project.org mailing list
>>> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>> >>>
>>> >>
>>> >
>>> >       [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > R-sig-mixed-models at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the R-sig-mixed-models mailing list