[R] Question on glm.nb vs zeroinfl vs hurdle models

Sun Sep 14 17:29:17 CEST 2008

Good afternoon, 

I’m in need of an advice regarding a proper use of glm.nb, zeroinfl or hurdle with my dataframe. 

I can not provide a self-contained example, since I need an advice on this current dataset and its “contradictory” results. 

So.... i have a dataset which contains 1309 cases and 11 variables, highly right-skewed and heavily zeroinflated (with over 1100 cases that have 0 value for my variables both dependent and independent, eg: variable A has 1220 cases with 0 value, variable B has 1283 with 0 value and so on..)     

I tried to fit 3 models: glm.nb, zeroinfl and hurdle and I was expecting some “similar” results and similar conclusions. 

What was similar was log-likelihood (very close for all 3 models) and the number of predicted 0 (which was identical for each model), but what surprised me were the following results: 

-glm.nb identified as having an influence the same variables that were identified by the hurdle model in the zero-model; 

-zerinfl model identified also d variable as influential; 

Now my question is the following: having seen the vignette (Regression Models for Count Data in R) I noticed that glm.nb, hurdle and zeroinfl give similar results for the count model, while for the zero-component hurdle and zeroifl may give slightly more different results, while for my example the count model from glm.nb is similar to the zero-component part of hurdle and zeroinfl. Why is that? Is there a problem with the fact that my dataset is  extremely zero-inflated, and there are few cases with values different from 0? 

Any kind of help would be most welcomed 

Thank you and have a great day ahead. 

> summary(aaa) 

Call: 

hurdle(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) + 

    as.integer(f) + as.integer(g) + as.integer(h), data = dep, dist = "negbin") 

Count model coefficients (truncated negbin with log link): 

                       Estimate Std. Error z value Pr(>|z|) 

(Intercept)            -0.02178    0.30753  -0.071    0.944 

as.integer(a) -0.48886    0.54023  -0.905    0.366 

as.integer(b)    -0.09555    0.11688  -0.817    0.414 

as.integer(c)     -0.08654    0.20809  -0.416    0.678 

as.integer(d)  0.17446    0.16956   1.029    0.304 

as.integer(e)     0.27180    0.55702   0.488    0.626 

as.integer(f)        0.15512    0.42721   0.363    0.717 

as.integerg)      -0.07687    0.21750  -0.353    0.724 

as.integer(h)       -0.16906    0.44986  -0.376    0.707 

Log(theta)             -0.76274    0.51800  -1.472    0.141 

Zero hurdle model coefficients (binomial with logit link): 

                       Estimate Std. Error z value Pr(>|z|)    

(Intercept)            -1.13498    0.07906 -14.356  < 2e-16 *** 

as.integer(a) -0.33134    0.30239  -1.096  0.27320    

as.integer(b)    -0.26394    0.08397  -3.143  0.00167 ** 

as.integer(c)      0.06689    0.12796   0.523  0.60115    

as.integer(d) -0.12045    0.11984  -1.005  0.31486    

as.integer(e)    -0.79314    0.29106  -2.725  0.00643 ** 

as.integer(f)       -0.28547    0.40790  -0.700  0.48402    

as.integer(g)      -0.33186    0.18887  -1.757  0.07890 .   

as.integer(h)       -0.37008    0.31035  -1.192  0.23308    

--- 

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 0.4664 

Number of iterations in BFGS optimization: 28 

Log-likelihood: -1073 on 19 Df 

> summary(a) 

Call: 

glm.nb(formula = as.integer(x) ~ as.integer(a) + 

    as.integer(b) + as.integer(c) + as.integer(d) + 

    as.integer(e) + as.integer(f) + as.integer(g) + 

    as.integer(h), data = dep, init.theta = 0.187836108765364, 

    link = log) 

Deviance Residuals: 

    Min       1Q   Median       3Q      Max  

-0.8607  -0.7236  -0.6809  -0.4610   2.7575  

Coefficients: 

                       Estimate Std. Error z value Pr(>|z|)    

(Intercept)            -0.56381    0.08820  -6.392 1.64e-10 *** 

as.integer(a) -0.51517    0.33477  -1.539  0.12384    

as.integer(b)    -0.21835    0.07250  -3.011  0.00260 ** 

as.integer(c)      0.08920    0.14546   0.613  0.53974    

as.integer(d) -0.01742    0.10877  -0.160  0.87274    

as.integer(e)    -0.69085    0.23446  -2.946  0.00321 ** 

as.integer(f)       -0.14182    0.42142  -0.337  0.73647    

as.integer(g)      -0.24976    0.15819  -1.579  0.11437    

as.integer(h)       -0.37652    0.30043  -1.253  0.21009    

--- 

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Negative Binomial(0.1878) family taken to be 1) 

    Null deviance: 707.18  on 1308  degrees of freedom 

Residual deviance: 677.09  on 1300  degrees of freedom 

AIC: 2181.5 

Number of Fisher Scoring iterations: 1 

              Theta:  0.1878 

          Std. Err.:  0.0186 

Warning while fitting theta: alternation limit reached 

> summary(aa) 
Call: 

zeroinfl(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) + 

    as.integer(f) + as.integer(g) + as.integer(h), data = dep, dist = "negbin") 

Count model coefficients (negbin with log link): 

                        Estimate Std. Error z value Pr(>|z|)  

(Intercept)            -0.030225   0.237197  -0.127   0.8986  

as.integer(a) -0.419544   0.667512  -0.629   0.5297  

as.integer(b)    -0.128478   0.132001  -0.973   0.3304  

as.integer(c)     -0.226652   0.146983  -1.542   0.1231  

as.integer(d)  0.226577   0.157547   1.438   0.1504  

as.integer(e)     0.374845   0.650778   0.576   0.5646  

as.integer(f)        0.381320   0.399210   0.955   0.3395  

as.integer(g)      -0.006804   0.195869  -0.035   0.9723  

as.integer(h)       -0.161501   0.426027  -0.379   0.7046  

Log(theta)             -0.776709   0.393571  -1.973   0.0484 * 

Zero-inflation model coefficients (binomial with logit link): 

                       Estimate Std. Error z value Pr(>|z|)  

(Intercept)             -0.3705     0.5458  -0.679   0.4973  

as.integer(a)   0.1848     1.1336   0.163   0.8705  

as.integer(b)      0.2453     0.1775   1.382   0.1669  

as.integer(c)      -1.2289     0.8108  -1.516   0.1296  

as.integer(d)   0.3749     0.2015   1.861   0.0628 . 

as.integer(e)      1.2458     0.4929   2.527   0.0115 * 

as.integer(f)         1.1177     0.7105   1.573   0.1157  

as.integer(g)        0.5752     0.3332   1.726   0.0843 . 

as.integer(h)         0.3890     0.5272   0.738   0.4606  

--- 

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 0.4599 

Number of iterations in BFGS optimization: 36 

Log-likelihood: -1072 on 19 Df