[R] Question on glm.nb vs zeroinfl vs hurdle models
eugen pircalabelu
eugen_pircalabelu at yahoo.com
Sun Sep 14 17:29:17 CEST 2008
Good afternoon,
I’m in need of an advice regarding a proper use of glm.nb, zeroinfl or hurdle with my dataframe.
I can not provide a self-contained example, since I need an advice on this current dataset and its “contradictory” results.
So.... i have a dataset which contains 1309 cases and 11 variables, highly right-skewed and heavily zeroinflated (with over 1100 cases that have 0 value for my variables both dependent and independent, eg: variable A has 1220 cases with 0 value, variable B has 1283 with 0 value and so on..)
I tried to fit 3 models: glm.nb, zeroinfl and hurdle and I was expecting some “similar” results and similar conclusions.
What was similar was log-likelihood (very close for all 3 models) and the number of predicted 0 (which was identical for each model), but what surprised me were the following results:
-glm.nb identified as having an influence the same variables that were identified by the hurdle model in the zero-model;
-zerinfl model identified also d variable as influential;
Now my question is the following: having seen the vignette (Regression Models for Count Data in R) I noticed that glm.nb, hurdle and zeroinfl give similar results for the count model, while for the zero-component hurdle and zeroifl may give slightly more different results, while for my example the count model from glm.nb is similar to the zero-component part of hurdle and zeroinfl. Why is that? Is there a problem with the fact that my dataset is extremely zero-inflated, and there are few cases with values different from 0?
Any kind of help would be most welcomed
Thank you and have a great day ahead.
> summary(aaa)
Call:
hurdle(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) +
as.integer(f) + as.integer(g) + as.integer(h), data = dep, dist = "negbin")
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02178 0.30753 -0.071 0.944
as.integer(a) -0.48886 0.54023 -0.905 0.366
as.integer(b) -0.09555 0.11688 -0.817 0.414
as.integer(c) -0.08654 0.20809 -0.416 0.678
as.integer(d) 0.17446 0.16956 1.029 0.304
as.integer(e) 0.27180 0.55702 0.488 0.626
as.integer(f) 0.15512 0.42721 0.363 0.717
as.integerg) -0.07687 0.21750 -0.353 0.724
as.integer(h) -0.16906 0.44986 -0.376 0.707
Log(theta) -0.76274 0.51800 -1.472 0.141
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.13498 0.07906 -14.356 < 2e-16 ***
as.integer(a) -0.33134 0.30239 -1.096 0.27320
as.integer(b) -0.26394 0.08397 -3.143 0.00167 **
as.integer(c) 0.06689 0.12796 0.523 0.60115
as.integer(d) -0.12045 0.11984 -1.005 0.31486
as.integer(e) -0.79314 0.29106 -2.725 0.00643 **
as.integer(f) -0.28547 0.40790 -0.700 0.48402
as.integer(g) -0.33186 0.18887 -1.757 0.07890 .
as.integer(h) -0.37008 0.31035 -1.192 0.23308
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.4664
Number of iterations in BFGS optimization: 28
Log-likelihood: -1073 on 19 Df
> summary(a)
Call:
glm.nb(formula = as.integer(x) ~ as.integer(a) +
as.integer(b) + as.integer(c) + as.integer(d) +
as.integer(e) + as.integer(f) + as.integer(g) +
as.integer(h), data = dep, init.theta = 0.187836108765364,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8607 -0.7236 -0.6809 -0.4610 2.7575
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.56381 0.08820 -6.392 1.64e-10 ***
as.integer(a) -0.51517 0.33477 -1.539 0.12384
as.integer(b) -0.21835 0.07250 -3.011 0.00260 **
as.integer(c) 0.08920 0.14546 0.613 0.53974
as.integer(d) -0.01742 0.10877 -0.160 0.87274
as.integer(e) -0.69085 0.23446 -2.946 0.00321 **
as.integer(f) -0.14182 0.42142 -0.337 0.73647
as.integer(g) -0.24976 0.15819 -1.579 0.11437
as.integer(h) -0.37652 0.30043 -1.253 0.21009
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.1878) family taken to be 1)
Null deviance: 707.18 on 1308 degrees of freedom
Residual deviance: 677.09 on 1300 degrees of freedom
AIC: 2181.5
Number of Fisher Scoring iterations: 1
Theta: 0.1878
Std. Err.: 0.0186
Warning while fitting theta: alternation limit reached
> summary(aa)
Call:
zeroinfl(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) +
as.integer(f) + as.integer(g) + as.integer(h), data = dep, dist = "negbin")
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.030225 0.237197 -0.127 0.8986
as.integer(a) -0.419544 0.667512 -0.629 0.5297
as.integer(b) -0.128478 0.132001 -0.973 0.3304
as.integer(c) -0.226652 0.146983 -1.542 0.1231
as.integer(d) 0.226577 0.157547 1.438 0.1504
as.integer(e) 0.374845 0.650778 0.576 0.5646
as.integer(f) 0.381320 0.399210 0.955 0.3395
as.integer(g) -0.006804 0.195869 -0.035 0.9723
as.integer(h) -0.161501 0.426027 -0.379 0.7046
Log(theta) -0.776709 0.393571 -1.973 0.0484 *
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3705 0.5458 -0.679 0.4973
as.integer(a) 0.1848 1.1336 0.163 0.8705
as.integer(b) 0.2453 0.1775 1.382 0.1669
as.integer(c) -1.2289 0.8108 -1.516 0.1296
as.integer(d) 0.3749 0.2015 1.861 0.0628 .
as.integer(e) 1.2458 0.4929 2.527 0.0115 *
as.integer(f) 1.1177 0.7105 1.573 0.1157
as.integer(g) 0.5752 0.3332 1.726 0.0843 .
as.integer(h) 0.3890 0.5272 0.738 0.4606
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 0.4599
Number of iterations in BFGS optimization: 36
Log-likelihood: -1072 on 19 Df
More information about the R-help
mailing list