[R] Need advice about models with ordinal input variables
Paul Johnson
pauljohn at ku.edu
Tue Nov 8 20:41:32 CET 2005
Dear colleagues:
I've been storing up this question for a long time and apologize for the
length and verbosity of it. I am having trouble in consulting with
graduate students on their research projects. They are using surveys to
investigate the sources of voter behavior or attitudes. They have
predictors that are factors, some ordered, but I am never confident in
telling them what they ought to do. Usually, they come in with a
regression model fitted as though these variables are numerical, and if
one looks about in the social science literature, one finds that many
people have published doing the same.
I want to ask your advice about some cases.
1. An ordered factor that masquerades as a numerical "interval level" score.
In the research journals, these are the ones most often treated as
numerical variables in regressions. For example: "Thermometer scores"
for Presidential candidates range from 0 to 100 in integer units.
What's a better idea? In an OLS model with just one input variable, a
plot will reveal if there is a significant "nonlinearity". One can
recode the assigned values to linearize the final model or take the
given values and make a nonlinear model.
In the R package "acepack" I found avas, which works like a "rubber
ruler" and recodes variables in order to make relationships as linear
and homoskedastic as possible. I've never seen this used in the social
science literature. It works like magic. Take an ugly scatterplot and
shazam, out come transformed variables that have a beautiful plot. But
what do you make of these things? There is so much going on in these
transformations that interpretation of the results is very difficult.
You can't say "a one unit increase in x causes a b increase in y".
Furthermore, if the model is a survival model, a logistic regression, or
other non-OLS model, I don't see how the avas approach will help.
I've tried fiddling about with smoothers, treating the input scores as
if they were numerical. I got this idea from Prof. Harrell's Regression
Modeling Strategies. In his Design package for R, one can include a
cubic spline for a variable in a model by replacing x with rcs(x). Very
convenient. If the results say the relationship is mostly linear, then
we might as well treat the input variable as a numerical score and save
some degrees of freedom.
But if the higher order terms are statistically significant, it is
difficult to know what to do. The best strategy I have found so far is
to calculate fitted values for particular inputs and then try to tell a
story about them.
2. Ordinal variables with less than 10 values.
Consider variables like self-reported ideology, where respondents are
asked to place themselves on a 7 point scale ranging from "very
conservative" to "very liberal". Or Party Identification on a 7 point
scale, ranging (in the US) from "Strong Democrat" to "Strong Republican".
It has been quite common to see these thrown into regression models as
if they were numerical.
I've sometimes found it useful to run a regression treating them as
unordered factors, and then I attempt to glean a pattern in the
coefficients. If the parameter estimates step up by a fixed proportion,
then one might think there's no damage from treating them as numerical
variables.
Yesterday, it occurred to us that there should be a signifance test to
determine if one looses predictive power by replacing the
factor-treatment of x with x itself. Is there a non-nested model test
that is most appropriate?
3. Truly numericals variable that are reported as "grouped" ordinal
scales. THese variables are aweful in many ways.
Income is often reported in a form like this:
1) Less than 20000
2) 20000 to 35000
3) 35001 to 50000
4) 50001 to 100000
5) above 100000
Education often appears in a form that has
1) 8 years or less
2) 9 years
3) 10 years
4) 11 years
5) 12 years
6) some college completed
7) undergraduate degree completed
8) graduate degree completed
These predictors pose many problems. We have dissimilar people grouped
together, so there are "errors in variables" and it seems obvious that
the scores should be recoded somehow to reflect the substance of the
differences among groups. But how?
4. Ordered variables with a small number of scores.
For example, "has your economic situation been
1) worse
2) same
3) better"
or "how do you feel when you see the American flag?"
1) no effect
2) OK
3) great
4) extatic
Anyway, in an R model, I think the right thing to do is to enter them
into a regression with as.ordered(x).
But I don't know what to say about the results. Has anybody written an
"idiots guide to orthogonal polynomials"? Aside from calculating fitted
values, how do you interpret these things? Is there ever a point when
you would say "we should treat that as a numerical variable with scores
1-2-3-4" rather than as an ordered factor?
If you have advice, I would be delighted to hear it.
--
Paul E. Johnson email: pauljohn at ku.edu
Dept. of Political Science http://lark.cc.ku.edu/~pauljohn
1541 Lilac Lane, Rm 504
University of Kansas Office: (785) 864-9086
Lawrence, Kansas 66044-3177 FAX: (785) 864-5700
More information about the R-help
mailing list