[R-sig-ME] Predictions from zero-inflated or hurdle models

Tue Mar 17 17:31:12 CET 2015

Hi,

Sorry - I should have replied to this post earlier. I've been working  
on predict/simulate methods for all MCMcglmm models (including  
zero-inflated/altered/hurdle/truncated models) and these are now  
largely complete (together with newdata options).

When predictions are to be taken after marginalising the random  
effects (including the `residual' over-dispersion) it is not possible  
to obtain closed form expressions. The options that will be available  
in MCMCglmm are:

1) algebraic approximation
2) numerical integration
3) simulation

1) and 2) are currently only accurate when the random/residual effect  
structure implies no covariance between the Poisson part and the  
binary part.

1) is reasonably accurate for zero-inflated distributions, but can be  
pretty poor for the remainder because they all involve zero-truncated  
Poisson log-normal distributions and my taylor approximation for the  
mean is less than ideal (any suggestions would be helpful).

2) could be extended to double integration in which case covariance  
between the Poisson part and the binary part could be handled.

In your code, part of the problem is that you have fitted a zapoisson,  
but the prediction is based on a zipoisson (with complementary log-log  
link rather than logt link).

In all zero-inflated/altered/hurdle/truncated models

E[y] = E[(1-prob)*meanc]

where prob is the probabilty of a zero in the binary part and meanc is  
the mean of a Poisson distribution (zipoisson) or a zero-truncated  
poisson (zapoisson and hupoisson). If we have eta_1 as the linear  
predictor for the poisson part and eta_2 as the linear predictor for  
the binary part:

In zipoisson: prob = plogis(eta_2)     and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2))  and meanc =  
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2)     and meanc =  
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0                 and meanc =  
exp(eta_1)/(1-exp(-exp(eta_1)))

In each case the linear predictor has a `fixed' part and a `random'  
part which I'll denote as `a' and `u' respectively. Ideally we would  
want

E[(1-prob)*meanc] taken over u_1 & u_2

if prob and meanc are independent this is a bit easier as

E[y] = E[1-prob]E[meanc]

and the two expectations ony need to taken with repsect to their  
respective random effects. If we have sd_1 and sd_2 as the standard  
deviations of the two sets of random effects then for the zapoisson:

   normal.evd<-function(x, mu, v){
      exp(-exp(x))*dnorm(x, mu, sqrt(v))
   }
   normal.zt<-function(x, mu, v){
     exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
   }

   pred<-function(a_1, a_2, sd_1, sd_2){
     prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),  
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
     meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),  
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
     prob*meanc
   }

#  gives the expected value with reasonable accuracy.  As an example:

   x<-rnorm(300)
   l1<-rnorm(300, 1/2+x, sqrt(1))
   l2<-rnorm(300, 1-x, sqrt(1))

   y<-rbinom(300, 1, 1-exp(-exp(l2)))
   y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,  
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
   # cunning sampler from Peter Dalgaard (R-sig-mixed)

   data=data.frame(y=y, x=x)
   prior=list(R=list(V=diag(2), nu=0.002, fix=2))

   m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,  
family="zapoisson", prior=prior)

   b_1<-colMeans(m1$Sol)[c(1,3)]
   b_2<-colMeans(m1$Sol)[c(2,4)]
   sd_1<-mean(sqrt(m1$VCV[,1]))
   sd_2<-mean(sqrt(m1$VCV[,2]))

   # note it is more accurate to take the posterior mean prediction  
rather than the prediction from the posterior means as I've done here,  
but for illustration:

   x.pred<-seq(-3,3,length=100)
   p<-1:100
   for(i in 1:100){
     p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =  
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
   }

   plot(y~x)
   lines(p~x.pred)

Cheers,

Jarrod

Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015  
13:33:25 +0000:

> Dear list,
>
> I've made a reproducible example of the zero-altered prediction,
> in the hope that someone will have a look and reassure me that I'm going
> about this the right way.
> I was a bit confused by the point about p_i and n_i being correlated (they
> are in my case), but I think this was a red herring for me
> since I'm not deriving the variance analytically.
> The script is here: https://gist.github.com/rubenarslan/aeacdd306b3d061819a6
> and if you don't want to run the simulation fit yourself, I've put an RDS
> file of the fit here:
> https://dl.dropboxusercontent.com/u/1078620/m1.rds
>
> Best regards,
>
> Ruben Arslan
>
> On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com> wrote:
>
>> Dear Dr Bolker,
>>
>> I'd thought about something like this, one point of asking was to  
>> see whether
>> a) it's implemented already, because I'll probably make dumb mistakes while
>> trying b) it's not implemented because it's a bad idea.
>> Your response and the MCMCglmm course notes make me hope that it's c) not
>> implemented because nobody did yet or d) it's so simple that everybody does
>> it on-the-fly.
>>
>> So I tried my hand and would appreciate corrections. I am sure there is
>> some screw-up or an inelegant approach in there.
>> I included code for dealing with mcmc.lists because that's what I have and
>> I'm not entirely sure how I deal with them is correct either.
>>
>> I started with a zero-altered model, because those fit fastest and
>> according to the course notes have the least complex likelihood.
>> Because I know not what I do, I'm not dealing with my random effects at
>> all.
>>
>> I pasted a model summary below to show what I've applied the below
>> function to. The function gives the following quantiles when applied to 19
>> chains of that model.
>>          5%         50%         95%
>> 5.431684178 5.561211207 5.690655200
>>          5%         50%         95%
>> 5.003974382 5.178192327 5.348246558
>>
>> Warm regards,
>>
>> Ruben Arslan
>>
>> HPDpredict_za = function(object, predictor) {
>>
>> if(class(object) != "MCMCglmm") {
>> if(length( object[[1]]$Residual$nrt )>1) {
>> object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
>> }
>> Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
>> vars = colnames(Sol[[1]])
>> } else {
>> Sol = as.data.frame(object$Sol)
>> vars = names(Sol)
>> }
>> za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
>> "traitza_"]
>> za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
>> "traitza_"]
>>  intercept = Sol[,"(Intercept)"]
>> za_intercept = Sol[, za_intercept_name]
>> l1 = Sol[, predictor ]
>> l2 = Sol[, za_predictor ]
>> if(is.list(object)) {
>> intercept = unlist(intercept)
>> za_intercept = unlist(za_intercept)
>> l1 = unlist(l1)
>> l2 = unlist(l2)
>> }
>>  py_0 = dpois(0, exp(intercept + za_intercept))
>> y_ygt0 = exp(intercept)
>> at_intercept = (1-py_0) * y_ygt0
>>
>> py_0 = dpois(0, exp(intercept + za_intercept + l2))
>> y_ygt0 = exp(intercept +  l1)
>> at_predictor_1 = (1-py_0) * y_ygt0
>> print(qplot(at_intercept))
>> print(qplot(at_predictor_1))
>> df = data.frame("intercept" = at_intercept)
>> df[, predictor] = at_predictor_1
>> print(qplot(x=variable,
>> y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40), geom =
>> 'violin'))
>> print(quantile(at_intercept, probs = c(0.05,0.5,0.95)))
>> print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95)))
>> invisible(df)
>> }
>>
>>
>> > summary(object[[1]])
>>
>>  Iterations = 100001:299901
>>  Thinning interval  = 100
>>  Sample size  = 2000
>>
>>  DIC: 349094
>>
>>  G-structure:  ~idh(trait):idParents
>>
>>                       post.mean l-95% CI u-95% CI eff.samp
>> children.idParents       0.0189   0.0164   0.0214     1729
>> za_children.idParents    0.2392   0.2171   0.2622     1647
>>
>>  R-structure:  ~idh(trait):units
>>
>>                   post.mean l-95% CI u-95% CI eff.samp
>> children.units        0.144    0.139    0.148     1715
>> za_children.units     1.000    1.000    1.000        0
>>
>>  Location effects: children ~ trait * (maternalage.factor + paternalloss +
>> maternalloss + center(nr.siblings) + birth.cohort + urban + male +
>> paternalage.mean + paternalage.diff)
>>
>>                                            post.mean  l-95% CI  u-95% CI
>> eff.samp   pMCMC
>> (Intercept)                                 2.088717  2.073009  2.103357
>>   2000 <0.0005 ***
>> traitza_children                           -1.933491 -1.981945 -1.887863
>>   2000 <0.0005 ***
>> maternalage.factor(14,20]                   0.007709 -0.014238  0.027883
>>   1500   0.460
>> maternalage.factor(35,50]                   0.006350 -0.009634  0.024107
>>   2000   0.462
>> paternallossTRUE                            0.000797 -0.022716  0.025015
>>   2000   0.925
>> maternallossTRUE                           -0.015542 -0.040240  0.009549
>>   2000   0.226
>> center(nr.siblings)                         0.005869  0.004302  0.007510
>>   2000 <0.0005 ***
>> birth.cohort(1703,1722]                    -0.045487 -0.062240 -0.028965
>>   2000 <0.0005 ***
>> birth.cohort(1722,1734]                    -0.055872 -0.072856 -0.036452
>>   2000 <0.0005 ***
>> birth.cohort(1734,1743]                    -0.039770 -0.056580 -0.020907
>>   2000 <0.0005 ***
>> birth.cohort(1743,1750]                    -0.030713 -0.048301 -0.012214
>>   2000   0.002 **
>> urban                                      -0.076748 -0.093240 -0.063002
>>   2567 <0.0005 ***
>> male                                        0.106074  0.095705  0.115742
>>   2000 <0.0005 ***
>> paternalage.mean                           -0.024119 -0.033133 -0.014444
>>   2000 <0.0005 ***
>> paternalage.diff                           -0.018367 -0.032083 -0.005721
>>   2000   0.007 **
>> traitza_children:maternalage.factor(14,20] -0.116510 -0.182432 -0.051978
>>   1876   0.001 ***
>> traitza_children:maternalage.factor(35,50] -0.045196 -0.094485  0.002640
>>   2000   0.075 .
>> traitza_children:paternallossTRUE          -0.171957 -0.238218 -0.104820
>>   2000 <0.0005 ***
>> traitza_children:maternallossTRUE          -0.499539 -0.566825 -0.430637
>>   2000 <0.0005 ***
>> traitza_children:center(nr.siblings)       -0.023723 -0.028676 -0.018746
>>   1848 <0.0005 ***
>> traitza_children:birth.cohort(1703,1722]   -0.026012 -0.074250  0.026024
>>   2000   0.319
>> traitza_children:birth.cohort(1722,1734]   -0.279418 -0.329462 -0.227187
>>   2000 <0.0005 ***
>> traitza_children:birth.cohort(1734,1743]   -0.260165 -0.312659 -0.204462
>>   2130 <0.0005 ***
>> traitza_children:birth.cohort(1743,1750]   -0.481457 -0.534568 -0.426648
>>   2000 <0.0005 ***
>> traitza_children:urban                     -0.604108 -0.645169 -0.562554
>>   1702 <0.0005 ***
>> traitza_children:male                      -0.414988 -0.444589 -0.387005
>>   2000 <0.0005 ***
>> traitza_children:paternalage.mean           0.006545 -0.018570  0.036227
>>   2000   0.651
>> traitza_children:paternalage.diff          -0.097982 -0.136302 -0.060677
>>   2000 <0.0005 ***
>>
>>
>> On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
>>
>>> Ruben Arslan <rubenarslan at ...> writes:
>>>
>>> >
>>> > Dear list,
>>> >
>>> > I wanted to ask: Is there any (maybe just back of the envelope) way to
>>> > obtain a response prediction for zero-inflated or hurdle type models?
>>> > I've fit such models in MCMCglmm, but I don't work in ecology and my
>>> > previous experience with explaining such models to "my audience" did not
>>> > bode well. When it comes to humans, the researchers I presented to are
>>> not
>>> > used to offspring count being zero-inflated (or acquainted with that
>>> > concept), but in my historical data with high infant mortality, it is
>>> (in
>>> > modern data it's actually slightly underdispersed).
>>> >
>>> > Currently I'm using lme4 and simply splitting my models into two stages
>>> > (finding a mate and having offspring).
>>> > That's okay too, but in one population the effect of interest is not
>>> > clearly visible in either stage, only when both are taken together (but
>>> > then the outcome is zero-inflated).
>>> > I expect to be given a hard time for this and hence thought I'd use a
>>> > binomial model with the outcome offspring>0 as my main model, but that
>>> > turns out to be hard to explain too and doesn't
>>> > really do the data justice.
>>> >
>>> > Basically I don't want to be forced to discuss my smallest population
>>> as a
>>> > non-replication of the effect because I was insufficiently able to
>>> explain
>>> > the statistics behind my reasoning that the effect shows.
>>>
>>>   I think the back-of-the envelope answer would be that for a two-stage
>>> model with a prediction of p_i for the probability of having a non-zero
>>> response (or in the case of zero-inflated models, the probability of
>>> _not_ having a structural zero) and a prediction of n_i for the
>>> conditional
>>> part of the model, the mean predicted value is p_i*n_i and the
>>> variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 + var(n_i)/n_i^2)
>>> (this is assuming
>>> that you haven't built in any correlation between p_i and n_i, which
>>> would be hard in lme4 but _might_ be possible under certain circumstances
>>> via a multitype model in MCMCglmm).
>>>
>>>   Does that help?
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.