[R-sig-eco] zero-truncated Poisson

Sun Apr 10 23:00:43 CEST 2011

Thanks Ben and Peter for your thoughtful replies.  This gives me quite a bit to chew on and digest but I think I can analyze these data.

-----Original Message-----
From: Ben Bolker [mailto:bbolker at gmail.com] 
Sent: Sunday, April 10, 2011 10:38 AM
To: Peter Solymos
Cc: Stratford, Jeffrey; r-sig-ecology at r-project.org
Subject: Re: [R-sig-eco] zero-truncated Poisson

On 11-04-09 03:07 AM, Peter Solymos wrote:
> Jeff,
> 
> The zero truncated Poisson distribution can be described as a
> probability function conditional on Y>0:
> 
> Pr(Y=y|Y>0) = Pr(Y=y) / (1-Pr(Y=0)), y=1,2,3,...
> 
> This leads to this function 'nll' for the negative log-likelihood:
> 
> nll <- function(theta, y, X) {
>     lambda <- exp(drop(X %*% theta))
>     -sum(dpois(y, lambda, log=TRUE) - log(1-exp(-lambda)))
> }
> 
> where theta is the vector of parameters, y is the observation vector,
> X is the design matrix.
> 
> A simple simulation is useful to check that the theory works. I used
> 'optim' with the default Nelder-Mead simplex algorithm, initial values
> are based in non-truncated Poisson estimates to speed up convergence:
> 
> theta <- c(1, 2)
> X <- model.matrix(~rnorm(100))
> y <- rpois(100,exp(X %*% theta))
> X <- X[y>0,]
> y <- y[y>0]
> inits <- coef(glm(y ~ X-1, family=poisson))
> res <- optim(par=inits, fn=nll, y=y, X=X)
> cbind(theta=theta, theta.hat=res$par)
> 
> Here I used the data you provided, note that 'hessian=TRUE' for some
> subsequent SE calculations:
> 
> inits <- coef(glm(taken ~ mass + year, blja, family=poisson))
> y <- blja$taken
> X <- model.matrix(~ mass + year, blja)
> res <- optim(par=inits, fn=nll, y=y, X=X, hessian=TRUE)
> 
> Vector of parameters that minimize 'nll' (maximize the log-likelihood)
> given the data:
> 
> cf <- res$par
> 
> Standard error (note that due to minimization, I used the Hessian
> instead of -H):
> 
> se <- sqrt(diag(solve(res$hessian)))
> 
> z and p values and the final output table:
> 
> zstat <- cf/se
> pval <- 2 * pnorm(-abs(zstat))
> out <- cbind(cf, se, zstat, pval)
> colnames(out) <- c("Estimate", "Std. Error", "z value", "Pr(>|z|)")
> 
> printCoefmat(out, digits = 4, signif.legend = FALSE)
> 
> For your data set, this gives:
> 
>             Estimate Std. Error z value Pr(>|z|)
> (Intercept) -0.76392    0.10659  -7.167 7.66e-13 ***
> mass         0.24008    0.01262  19.028  < 2e-16 ***
> yeartwo     -0.29006    0.11974  -2.423   0.0154 *
> 
> Cheers,
> 
> Peter
> 
> Péter Sólymos
> Alberta Biodiversity Monitoring Institute
> and Boreal Avian Modelling project
> Department of Biological Sciences
> CW 405, Biological Sciences Bldg
> University of Alberta
> Edmonton, Alberta, T6G 2E9, Canada
> Phone: 780.492.8534
> Fax: 780.492.7635
> email <- paste("solymos", "ualberta.ca", sep = "@")
> http://www.abmi.ca
> http://www.borealbirds.ca
> http://sites.google.com/site/psolymos
> 

  I agree.  I would just like to point out that, if you define the
truncated Poisson distribution function appropriately, you can then do
this very compactly via the mle2 function in the bbmle package.

dtruncpois <- function(x,lambda,log=FALSE) {
  r <- ifelse(x==0,-Inf,dpois(x,lambda,log=TRUE)-
              log(1-dpois(0,lambda,log=FALSE)))
  if (log) r else exp(r)
}

library(bbmle)
(m1 <- mle2(taken~dtruncpois(exp(logmu)),
           parameters=list(logmu~mass+year),
           data=blja,
           start=list(logmu=1)))
coef(m1)
summary(m1)  ## gives a table almost equivalent to Peter's
confint(m1)  ## profile confidence intervals

  After a bit of digging it seems that VGAM::vglm fails when it hits an
NaN -- I haven't figured out more than that it gets this on the *second*
step, not the first, of the optimization. When it works VGAM should be
faster than the sort of generalized ML estimation that Peter and I are
suggesting, but I have fairly often found it to be not very robust when
I try it.

 If you plot the data (with a regular Poisson GLM superimposed):

library(ggplot2)
ggplot(blja,aes(x=mass,y=taken,colour=year))+
  stat_sum(alpha=0.5)+geom_smooth(method="glm",family="poisson")+
  theme_bw()

  You see that there are two observations with very large masses.
It turns out that removing these two observations allows vglm() to work:

glm1 <- vglm(taken ~ mass + year,family=pospoisson,
             data=blja,subset=mass<10)

m2 <- mle2(taken~dtruncpois(exp(logmu)),
            parameters=list(logmu~mass+year),
            data=subset(blja,mass<10),
            start=list(logmu=1))

 It also changes the estimate of the mass effect considerably:

confint(m1)
confint(m2)

  Ben Bolker