ziP {mgcv} | R Documentation |
GAM zero-inflated (hurdle) Poisson regression family
Description
Family for use with gam
or bam
, implementing regression for zero inflated Poisson data
when the complimentary log log of the zero probability is linearly dependent on the log of the Poisson parameter. Use with great care, noting that simply having many zero response observations is not an indication of zero inflation: the question is whether you have too many zeroes given the specified model.
This sort of model is really only appropriate when none of your covariates help to explain the zeroes in your data. If your covariates predict which observations are likely to have zero mean then adding a zero inflated model on top of this is likely to lead to identifiability problems. Identifiability problems may lead to fit failures, or absurd values for the linear predictor or predicted values.
Usage
ziP(theta = NULL, link = "identity",b=0)
Arguments
theta |
the 2 parameters controlling the slope and intercept of the
linear transform of the mean controlling the zero inflation rate. If supplied
then treated as fixed parameters ( |
link |
The link function: only the |
b |
a non-negative constant, specifying the minimum dependence of the zero inflation rate on the linear predictor. |
Details
The probability of a zero count is given by 1-p
, whereas the probability of
count y>0
is given by the truncated Poisson probability function p\mu^y/((\exp(\mu)-1)y!)
. The linear predictor
gives \log \mu
, while \eta = \log(-\log(1-p))
and \eta = \theta_1 + \{b+\exp(\theta_2)\} \log \mu
. The theta
parameters are estimated alongside the smoothing parameters. Increasing the b
parameter from zero can greatly reduce identifiability problems, particularly when there are very few non-zero data.
The fitted values for this model are the log of the Poisson parameter. Use the predict
function with type=="response"
to get the predicted expected response. Note that the theta parameters reported in model summaries are \theta_1
and b + \exp(\theta_2)
.
These models should be subject to very careful checking, especially if fitting has not converged. It is quite easy to set up models with identifiability problems, particularly if the data are not really zero inflated, but simply have many zeroes because the mean is very low in some parts of the covariate space. See example for some obvious checks. Take convergence warnings seriously.
Value
An object of class extended.family
.
WARNINGS
Zero inflated models are often over-used. Having lots of zeroes in the data does not in itself imply zero inflation. Having too many zeroes *given the model mean* may imply zero inflation.
Author(s)
Simon N. Wood simon.wood@r-project.org
References
Wood, S.N., N. Pya and B. Saefken (2016), Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association 111, 1548-1575 doi:10.1080/01621459.2016.1180986
See Also
Examples
rzip <- function(gamma,theta= c(-2,.3)) {
## generate zero inflated Poisson random variables, where
## lambda = exp(gamma), eta = theta[1] + exp(theta[2])*gamma
## and 1-p = exp(-exp(eta)).
y <- gamma; n <- length(y)
lambda <- exp(gamma)
eta <- theta[1] + exp(theta[2])*gamma
p <- 1- exp(-exp(eta))
ind <- p > runif(n)
y[!ind] <- 0
np <- sum(ind)
## generate from zero truncated Poisson, given presence...
y[ind] <- qpois(runif(np,dpois(0,lambda[ind]),1),lambda[ind])
y
}
library(mgcv)
## Simulate some ziP data...
set.seed(1);n<-400
dat <- gamSim(1,n=n)
dat$y <- rzip(dat$f/4-1)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=ziP(),data=dat)
b$outer.info ## check convergence!!
b
plot(b,pages=1)
plot(b,pages=1,unconditional=TRUE) ## add s.p. uncertainty
gam.check(b)
## more checking...
## 1. If the zero inflation rate becomes decoupled from the linear predictor,
## it is possible for the linear predictor to be almost unbounded in regions
## containing many zeroes. So examine if the range of predicted values
## is sane for the zero cases?
range(predict(b,type="response")[b$y==0])
## 2. Further plots...
par(mfrow=c(2,2))
plot(predict(b,type="response"),residuals(b))
plot(predict(b,type="response"),b$y);abline(0,1,col=2)
plot(b$linear.predictors,b$y)
qq.gam(b,rep=20,level=1)
## 3. Refit fixing the theta parameters at their estimated values, to check we
## get essentially the same fit...
thb <- b$family$getTheta()
b0 <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=ziP(theta=thb),data=dat)
b;b0
## Example fit forcing minimum linkage of prob present and
## linear predictor. Can fix some identifiability problems.
b2 <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=ziP(b=.3),data=dat)