[R] pscl package and hurdle model marginal effects

Achim Zeileis Achim.Zeileis at uibk.ac.at
Wed Jan 18 19:59:56 CET 2012


Travis,

thanks for the follow-up. As the reply got somewhat lengthier and includes 
a worked example, I include the R-help mailing list again. Maybe the reply 
is useful to someone else in the future.

> Thanks for the reply. I think you cleared it up, but I would like to 
> follow up to be certain. I also may have mixed some terminology in my 
> original question. I have been following Greene's econometrics text, but 
> I also have your Applied Econometrics with R that I'll reference for 
> this question.

Yes, good idea. Let's take the RecreationDemand data, that's simple. And 
before using confusing terminology, let's look at a practical application, 
that's often more illuminating. Let's use the hurdle model for the number 
of recreational boat trips:

## packages and data
library("pscl")
data("RecreationDemand", package = "AER")

## model
m <- hurdle(trips ~ . | quality + income,
   data = RecreationDemand, dist = "negbin")

As you say below, we get a coefficient for "quality" of 0.17 and 1.5 
respectively. The first one means that the _odds_ (not the probabilities) 
of observing a non-zero number of trips increase strongly with quality: 
about 350% per unit of quality = exp(1.5) - 1. This is still not too 
intuitive due to the usage of odds, but we'll get to probabilities later. 
And the 0.17 mean that the expected number of trips in the count component 
increase by 18% = exp(0.17) - 1 per unity of quality.

But we can also make this even easier to interpret by looking at the 
_effects_ (not the marginal effects) of quality for a "typical" individual 
with average income, average costs, etc.

## new data with quality = 0, ..., 5 and "typical" values
## for all other variables
nd <- data.frame(quality = 0:5, ski = "no", userfee = "no",
   income = mean(RecreationDemand$income),
   costC = mean(RecreationDemand$costC),
   costS = mean(RecreationDemand$costS),
   costH = mean(RecreationDemand$costH))

## predicted probability for non-zero trips
nonzero <- 1 - predict(m, newdata = nd, type = "prob")[, 1]
## visualize
plot(0:5, nonzero, type = "b")
## slopes
diff(nonzero)

So we see that if this "typical" individual had a quality rating of zero, 
the probability of taking any trips at all would only be 5%. While for the 
same "typical" individual with quality rating of 5, the probability of 
taking at least one trip would be 99%. Also we see that the influence is 
clearly non-linear and S-shaped. The slope varies between only 3.4 
percentage points to up to 32 percentage points!

Similarly, we can compute the predicted mean from the count component of 
the model (untruncated):

## predicted mean from (untruncated) count component
countmean <- predict(m, newdata = nd, type = "count")
plot(0:5, countmean, type = "b")

This is also not linear, but not as strongly as for the binary part. 
Finally, we can also combine both model parts to the overall response:

## predicted overall mean (combining both model parts)
allmean <- predict(m, newdata = nd, type = "response")
plot(0:5, allmean, type = "b")

which is still rather S-shaped.

These curves are sometimes also referred to as "effects" and I find them 
rather easy to interpret. For each of these curves you can also look at 
the "marginal effects", i.e., the average slope.

So, if we want to look at the marginal effect of "quality" in the binary 
hurdle part, we can use Equation 5.2 from our AER book or equivalently 
Equation 21-10 from Greene's book (5th ed).

## average regressor
x <- c(1, mean(RecreationDemand$quality), mean(RecreationDemand$income))
## coefficients
b <- coef(m, model = "zero")
## linear predictor x'b
xb <- sum(x * b)
## marginal effect
dlogis(xb) * b["quality"]

As the average quality is 1.42, it is not surprising that this slope is 
similar to the diff(nonzero)[2] (i.e., going from quality=1 to quality=2):
an increase of about 32 percentage points. But we have seen above that the 
slope at quality=4 would be very different from this!

With similar arguments you could compute average slopes for the count mean 
or the overall mean. I don't do it here because I find the _effects_ 
computed above much more useful than the _marginal effects_. However, as 
econometricians prefer numbers to graphics, you don't see many effects 
plots in econometric works. Also, Stata computes marginal effects for just 
about everything - no matter whether it is useful or not.

For standard lm and glm as well as multinom and polr objects, John Fox's
"effects" package can also produce very nice effect displays 
automatically. It does not work for hurdle objects, though.

> On pages 140 - 141 of your text you run a hurdle model for recreational 
> trips. In the count model the coefficient on QUALITY is 0.1717, and in 
> the zero hurdle model the coefficient on quality is 1.5029. Let us 
> pretend that QUALITY is the only independent variable. It is my 
> understanding that neither 0.1717 or 1.5029 are marginal effects, i.e., 
> we cannot say a one percent increase in quality would cause a 0.1717 
> increase in trips. Because the model is nonlinear, the marginal effects 
> will vary with the value of the covariates. Greene suggests evaluating 
> the expression at 1) the sample means of the data or 2) "evaluate the 
> marginal effects at every observation and use the sample average of the 
> individual marginal effects" (pg 668 in 5th edition).

Yes, but this is only for binary regressions. There it is rather clear 
that you want to track changes in the mean i.e. in the probability. 
However for hurdle models, there are various quantities you could be 
interested in: probability for non-zero count, untruncated count mean, 
truncated count mean, overall mean, etc. That's why there is no single 
marginal effect of interest.

Hope this clarifies some of the previous issues!
Best regards,
Achim



More information about the R-help mailing list