[R] pscl package and hurdle model marginal effects
Achim Zeileis
Achim.Zeileis at uibk.ac.at
Wed Jan 18 19:59:56 CET 2012
Travis,
thanks for the follow-up. As the reply got somewhat lengthier and includes
a worked example, I include the R-help mailing list again. Maybe the reply
is useful to someone else in the future.
> Thanks for the reply. I think you cleared it up, but I would like to
> follow up to be certain. I also may have mixed some terminology in my
> original question. I have been following Greene's econometrics text, but
> I also have your Applied Econometrics with R that I'll reference for
> this question.
Yes, good idea. Let's take the RecreationDemand data, that's simple. And
before using confusing terminology, let's look at a practical application,
that's often more illuminating. Let's use the hurdle model for the number
of recreational boat trips:
## packages and data
library("pscl")
data("RecreationDemand", package = "AER")
## model
m <- hurdle(trips ~ . | quality + income,
data = RecreationDemand, dist = "negbin")
As you say below, we get a coefficient for "quality" of 0.17 and 1.5
respectively. The first one means that the _odds_ (not the probabilities)
of observing a non-zero number of trips increase strongly with quality:
about 350% per unit of quality = exp(1.5) - 1. This is still not too
intuitive due to the usage of odds, but we'll get to probabilities later.
And the 0.17 mean that the expected number of trips in the count component
increase by 18% = exp(0.17) - 1 per unity of quality.
But we can also make this even easier to interpret by looking at the
_effects_ (not the marginal effects) of quality for a "typical" individual
with average income, average costs, etc.
## new data with quality = 0, ..., 5 and "typical" values
## for all other variables
nd <- data.frame(quality = 0:5, ski = "no", userfee = "no",
income = mean(RecreationDemand$income),
costC = mean(RecreationDemand$costC),
costS = mean(RecreationDemand$costS),
costH = mean(RecreationDemand$costH))
## predicted probability for non-zero trips
nonzero <- 1 - predict(m, newdata = nd, type = "prob")[, 1]
## visualize
plot(0:5, nonzero, type = "b")
## slopes
diff(nonzero)
So we see that if this "typical" individual had a quality rating of zero,
the probability of taking any trips at all would only be 5%. While for the
same "typical" individual with quality rating of 5, the probability of
taking at least one trip would be 99%. Also we see that the influence is
clearly non-linear and S-shaped. The slope varies between only 3.4
percentage points to up to 32 percentage points!
Similarly, we can compute the predicted mean from the count component of
the model (untruncated):
## predicted mean from (untruncated) count component
countmean <- predict(m, newdata = nd, type = "count")
plot(0:5, countmean, type = "b")
This is also not linear, but not as strongly as for the binary part.
Finally, we can also combine both model parts to the overall response:
## predicted overall mean (combining both model parts)
allmean <- predict(m, newdata = nd, type = "response")
plot(0:5, allmean, type = "b")
which is still rather S-shaped.
These curves are sometimes also referred to as "effects" and I find them
rather easy to interpret. For each of these curves you can also look at
the "marginal effects", i.e., the average slope.
So, if we want to look at the marginal effect of "quality" in the binary
hurdle part, we can use Equation 5.2 from our AER book or equivalently
Equation 21-10 from Greene's book (5th ed).
## average regressor
x <- c(1, mean(RecreationDemand$quality), mean(RecreationDemand$income))
## coefficients
b <- coef(m, model = "zero")
## linear predictor x'b
xb <- sum(x * b)
## marginal effect
dlogis(xb) * b["quality"]
As the average quality is 1.42, it is not surprising that this slope is
similar to the diff(nonzero)[2] (i.e., going from quality=1 to quality=2):
an increase of about 32 percentage points. But we have seen above that the
slope at quality=4 would be very different from this!
With similar arguments you could compute average slopes for the count mean
or the overall mean. I don't do it here because I find the _effects_
computed above much more useful than the _marginal effects_. However, as
econometricians prefer numbers to graphics, you don't see many effects
plots in econometric works. Also, Stata computes marginal effects for just
about everything - no matter whether it is useful or not.
For standard lm and glm as well as multinom and polr objects, John Fox's
"effects" package can also produce very nice effect displays
automatically. It does not work for hurdle objects, though.
> On pages 140 - 141 of your text you run a hurdle model for recreational
> trips. In the count model the coefficient on QUALITY is 0.1717, and in
> the zero hurdle model the coefficient on quality is 1.5029. Let us
> pretend that QUALITY is the only independent variable. It is my
> understanding that neither 0.1717 or 1.5029 are marginal effects, i.e.,
> we cannot say a one percent increase in quality would cause a 0.1717
> increase in trips. Because the model is nonlinear, the marginal effects
> will vary with the value of the covariates. Greene suggests evaluating
> the expression at 1) the sample means of the data or 2) "evaluate the
> marginal effects at every observation and use the sample average of the
> individual marginal effects" (pg 668 in 5th edition).
Yes, but this is only for binary regressions. There it is rather clear
that you want to track changes in the mean i.e. in the probability.
However for hurdle models, there are various quantities you could be
interested in: probability for non-zero count, untruncated count mean,
truncated count mean, overall mean, etc. That's why there is no single
marginal effect of interest.
Hope this clarifies some of the previous issues!
Best regards,
Achim
More information about the R-help
mailing list