ziP {mgcv} | R Documentation |

## GAM zero-inflated (hurdle) Poisson regression family

### Description

Family for use with `gam`

or `bam`

, implementing regression for zero inflated Poisson data
when the complimentary log log of the zero probability is linearly dependent on the log of the Poisson parameter. Use with great care, noting that simply having many zero response observations is not an indication of zero inflation: the question is whether you have too many zeroes given the specified model.

This sort of model is really only appropriate when none of your covariates help to explain the zeroes in your data. If your covariates predict which observations are likely to have zero mean then adding a zero inflated model on top of this is likely to lead to identifiability problems. Identifiability problems may lead to fit failures, or absurd values for the linear predictor or predicted values.

### Usage

```
ziP(theta = NULL, link = "identity",b=0)
```

### Arguments

`theta` |
the 2 parameters controlling the slope and intercept of the
linear transform of the mean controlling the zero inflation rate. If supplied
then treated as fixed parameters ( |

`link` |
The link function: only the |

`b` |
a non-negative constant, specifying the minimum dependence of the zero inflation rate on the linear predictor. |

### Details

The probability of a zero count is given by `1-p`

, whereas the probability of
count `y>0`

is given by the truncated Poisson probability function `p\mu^y/((\exp(\mu)-1)y!)`

. The linear predictor
gives `\log \mu`

, while `\eta = \log(-\log(1-p)) `

and `\eta = \theta_1 + \{b+\exp(\theta_2)\} \log \mu `

. The `theta`

parameters are estimated alongside the smoothing parameters. Increasing the `b`

parameter from zero can greatly reduce identifiability problems, particularly when there are very few non-zero data.

The fitted values for this model are the log of the Poisson parameter. Use the `predict`

function with `type=="response"`

to get the predicted expected response. Note that the theta parameters reported in model summaries are `\theta_1`

and `b + \exp(\theta_2)`

.

These models should be subject to very careful checking, especially if fitting has not converged. It is quite easy to set up models with identifiability problems, particularly if the data are not really zero inflated, but simply have many zeroes because the mean is very low in some parts of the covariate space. See example for some obvious checks. Take convergence warnings seriously.

### Value

An object of class `extended.family`

.

### WARNINGS

Zero inflated models are often over-used. Having lots of zeroes in the data does not in itself imply zero inflation. Having too many zeroes *given the model mean* may imply zero inflation.

### Author(s)

Simon N. Wood simon.wood@r-project.org

### References

Wood, S.N., N. Pya and B. Saefken (2016), Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association 111, 1548-1575 doi:10.1080/01621459.2016.1180986

### See Also

### Examples

```
rzip <- function(gamma,theta= c(-2,.3)) {
## generate zero inflated Poisson random variables, where
## lambda = exp(gamma), eta = theta[1] + exp(theta[2])*gamma
## and 1-p = exp(-exp(eta)).
y <- gamma; n <- length(y)
lambda <- exp(gamma)
eta <- theta[1] + exp(theta[2])*gamma
p <- 1- exp(-exp(eta))
ind <- p > runif(n)
y[!ind] <- 0
np <- sum(ind)
## generate from zero truncated Poisson, given presence...
y[ind] <- qpois(runif(np,dpois(0,lambda[ind]),1),lambda[ind])
y
}
library(mgcv)
## Simulate some ziP data...
set.seed(1);n<-400
dat <- gamSim(1,n=n)
dat$y <- rzip(dat$f/4-1)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=ziP(),data=dat)
b$outer.info ## check convergence!!
b
plot(b,pages=1)
plot(b,pages=1,unconditional=TRUE) ## add s.p. uncertainty
gam.check(b)
## more checking...
## 1. If the zero inflation rate becomes decoupled from the linear predictor,
## it is possible for the linear predictor to be almost unbounded in regions
## containing many zeroes. So examine if the range of predicted values
## is sane for the zero cases?
range(predict(b,type="response")[b$y==0])
## 2. Further plots...
par(mfrow=c(2,2))
plot(predict(b,type="response"),residuals(b))
plot(predict(b,type="response"),b$y);abline(0,1,col=2)
plot(b$linear.predictors,b$y)
qq.gam(b,rep=20,level=1)
## 3. Refit fixing the theta parameters at their estimated values, to check we
## get essentially the same fit...
thb <- b$family$getTheta()
b0 <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=ziP(theta=thb),data=dat)
b;b0
## Example fit forcing minimum linkage of prob present and
## linear predictor. Can fix some identifiability problems.
b2 <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=ziP(b=.3),data=dat)
```

*mgcv*version 1.9-0 Index]