eugen pircalabelu eugen_pircalabelu at yahoo.com
Sun Sep 14 17:29:17 CEST 2008

```Good afternoon,

I’m in need of an advice regarding a proper use of glm.nb, zeroinfl or hurdle with my dataframe.

I can not provide a self-contained example, since I need an advice on this current dataset and its “contradictory” results.

So.... i have a dataset which contains 1309 cases and 11 variables, highly right-skewed and heavily zeroinflated (with over 1100 cases that have 0 value for my variables both dependent and independent, eg: variable A has 1220 cases with 0 value, variable B has 1283 with 0 value and so on..)

I tried to fit 3 models: glm.nb, zeroinfl and hurdle and I was expecting some “similar” results and similar conclusions.

What was similar was log-likelihood (very close for all 3 models) and the number of predicted 0 (which was identical for each model), but what surprised me were the following results:

-glm.nb identified as having an influence the same variables that were identified by the hurdle model in the zero-model;

-zerinfl model identified also d variable as influential;

Now my question is the following: having seen the vignette (Regression Models for Count Data in R) I noticed that glm.nb, hurdle and zeroinfl give similar results for the count model, while for the zero-component hurdle and zeroifl may give slightly more different results, while for my example the count model from glm.nb is similar to the zero-component part of hurdle and zeroinfl. Why is that? Is there a problem with the fact that my dataset is  extremely zero-inflated, and there are few cases with values different from 0?

Any kind of help would be most welcomed

Thank you and have a great day ahead.

Call:

hurdle(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) +

as.integer(f) + as.integer(g) + as.integer(h), data = dep, dist = "negbin")

Count model coefficients (truncated negbin with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept)            -0.02178    0.30753  -0.071    0.944

as.integer(a) -0.48886    0.54023  -0.905    0.366

as.integer(b)    -0.09555    0.11688  -0.817    0.414

as.integer(c)     -0.08654    0.20809  -0.416    0.678

as.integer(d)  0.17446    0.16956   1.029    0.304

as.integer(e)     0.27180    0.55702   0.488    0.626

as.integer(f)        0.15512    0.42721   0.363    0.717

as.integerg)      -0.07687    0.21750  -0.353    0.724

as.integer(h)       -0.16906    0.44986  -0.376    0.707

Log(theta)             -0.76274    0.51800  -1.472    0.141

Zero hurdle model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept)            -1.13498    0.07906 -14.356  < 2e-16 ***

as.integer(a) -0.33134    0.30239  -1.096  0.27320

as.integer(b)    -0.26394    0.08397  -3.143  0.00167 **

as.integer(c)      0.06689    0.12796   0.523  0.60115

as.integer(d) -0.12045    0.11984  -1.005  0.31486

as.integer(e)    -0.79314    0.29106  -2.725  0.00643 **

as.integer(f)       -0.28547    0.40790  -0.700  0.48402

as.integer(g)      -0.33186    0.18887  -1.757  0.07890 .

as.integer(h)       -0.37008    0.31035  -1.192  0.23308

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 0.4664

Number of iterations in BFGS optimization: 28

Log-likelihood: -1073 on 19 Df

Call:

glm.nb(formula = as.integer(x) ~ as.integer(a) +

as.integer(b) + as.integer(c) + as.integer(d) +

as.integer(e) + as.integer(f) + as.integer(g) +

as.integer(h), data = dep, init.theta = 0.187836108765364,

Deviance Residuals:

Min       1Q   Median       3Q      Max

-0.8607  -0.7236  -0.6809  -0.4610   2.7575

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)            -0.56381    0.08820  -6.392 1.64e-10 ***

as.integer(a) -0.51517    0.33477  -1.539  0.12384

as.integer(b)    -0.21835    0.07250  -3.011  0.00260 **

as.integer(c)      0.08920    0.14546   0.613  0.53974

as.integer(d) -0.01742    0.10877  -0.160  0.87274

as.integer(e)    -0.69085    0.23446  -2.946  0.00321 **

as.integer(f)       -0.14182    0.42142  -0.337  0.73647

as.integer(g)      -0.24976    0.15819  -1.579  0.11437

as.integer(h)       -0.37652    0.30043  -1.253  0.21009

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.1878) family taken to be 1)

Null deviance: 707.18  on 1308  degrees of freedom

Residual deviance: 677.09  on 1300  degrees of freedom

AIC: 2181.5

Number of Fisher Scoring iterations: 1

Theta:  0.1878

Std. Err.:  0.0186

Warning while fitting theta: alternation limit reached

Call:

zeroinfl(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) +

as.integer(f) + as.integer(g) + as.integer(h), data = dep, dist = "negbin")

Count model coefficients (negbin with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept)            -0.030225   0.237197  -0.127   0.8986

as.integer(a) -0.419544   0.667512  -0.629   0.5297

as.integer(b)    -0.128478   0.132001  -0.973   0.3304

as.integer(c)     -0.226652   0.146983  -1.542   0.1231

as.integer(d)  0.226577   0.157547   1.438   0.1504

as.integer(e)     0.374845   0.650778   0.576   0.5646

as.integer(f)        0.381320   0.399210   0.955   0.3395

as.integer(g)      -0.006804   0.195869  -0.035   0.9723

as.integer(h)       -0.161501   0.426027  -0.379   0.7046

Log(theta)             -0.776709   0.393571  -1.973   0.0484 *

Zero-inflation model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept)             -0.3705     0.5458  -0.679   0.4973

as.integer(a)   0.1848     1.1336   0.163   0.8705

as.integer(b)      0.2453     0.1775   1.382   0.1669

as.integer(c)      -1.2289     0.8108  -1.516   0.1296

as.integer(d)   0.3749     0.2015   1.861   0.0628 .

as.integer(e)      1.2458     0.4929   2.527   0.0115 *

as.integer(f)         1.1177     0.7105   1.573   0.1157

as.integer(g)        0.5752     0.3332   1.726   0.0843 .

as.integer(h)         0.3890     0.5272   0.738   0.4606

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta = 0.4599

Number of iterations in BFGS optimization: 36

Log-likelihood: -1072 on 19 Df

```