[R] Need advice about models with ordinal input variables

Tue Nov 8 20:41:32 CET 2005

Dear colleagues:

I've been storing up this question for a long time and apologize for the 
length and verbosity of it.  I am having trouble in consulting with 
graduate students on their research projects.  They are using surveys to 
investigate the sources of voter behavior or attitudes.  They have 
predictors that are factors, some ordered, but I am never confident in 
telling them what they ought to do.  Usually, they come in with a 
regression model fitted as though these variables are numerical, and if 
one looks about in the social science literature, one finds that many 
people have published doing the same.

I want to ask your advice about some cases.

1. An ordered factor that masquerades as a numerical "interval level" score.

In the research journals, these are the ones most often treated as 
numerical variables in regressions. For example: "Thermometer scores" 
for Presidential candidates range from 0 to 100 in integer units.

What's a better idea?  In an OLS model with just one input variable, a 
plot will reveal if there is a significant "nonlinearity". One can 
recode the assigned values to linearize the final model or take the 
given values and make a nonlinear model.

In the R package "acepack" I found avas, which works like a "rubber 
ruler" and recodes variables in order to make relationships as linear 
and homoskedastic as possible.  I've never seen this used in the social 
science literature.  It works like magic.  Take an ugly scatterplot and 
shazam, out come transformed variables that have a beautiful plot.  But 
what do you make of these things?  There is so much going on in these 
transformations that interpretation of the results is very difficult. 
You can't say "a one unit increase in x causes a b increase in y". 
Furthermore, if the model is a survival model, a logistic regression, or 
other non-OLS model, I don't see how the avas approach will help.

I've tried fiddling about with smoothers, treating the input scores as 
if they were numerical.  I got this idea from Prof. Harrell's Regression 
Modeling Strategies.  In his Design package for R, one can include a 
cubic spline for a variable in a model by replacing x with rcs(x). Very 
convenient. If the results say the relationship is mostly linear, then 
we might as well treat the input variable as a numerical score and save 
some degrees of freedom.

But if the higher order terms are statistically significant, it is 
difficult to know what to do. The best strategy I have found so far is 
to calculate fitted values for particular inputs and then try to tell a 
story about them.

2. Ordinal variables with less than 10 values.

Consider variables like self-reported ideology, where respondents are 
asked to place themselves on a 7 point scale ranging from "very 
conservative" to "very liberal".  Or Party Identification on a 7 point 
scale, ranging (in the US) from "Strong Democrat" to "Strong Republican".

It has been quite common to see these thrown into regression models as 
if they were numerical.

I've sometimes found it useful to run a regression treating them as 
unordered factors, and then I attempt to glean a pattern in the 
coefficients.  If the parameter estimates step up by a fixed proportion, 
then one might think there's no damage from treating them as numerical 
variables.

Yesterday, it occurred to us that there should be a signifance test to 
determine if one looses predictive power by replacing the 
factor-treatment of x with x itself.  Is there a non-nested model test 
that is most appropriate?

3. Truly numericals variable that are reported as "grouped" ordinal 
scales. THese variables are aweful in many ways.

Income is often reported in a form like this:

1) Less than 20000
2) 20000 to 35000
3) 35001 to 50000
4) 50001 to 100000
5) above 100000

Education often appears in a form that has
1) 8 years or less
2) 9 years
3) 10 years
4) 11 years
5) 12 years
6) some college completed
7) undergraduate degree completed
8) graduate degree completed

These predictors pose many problems.  We have dissimilar people grouped 
together, so there are "errors in variables" and it seems obvious that 
the scores should be recoded somehow to reflect the substance of the 
differences among groups. But how?

4. Ordered variables with a small number of scores.

For example, "has your economic situation been
    1) worse
    2) same
    3) better"

or "how do you feel when you see the American flag?"
    1) no effect
    2) OK
    3) great
    4) extatic

Anyway, in an R model, I think the right thing to do is to enter them 
into a regression with as.ordered(x).

But I don't know what to say about the results.  Has anybody written an 
"idiots guide to orthogonal polynomials"?  Aside from calculating fitted 
values, how do you interpret these things?  Is there ever a point when 
you would say "we should treat that as a numerical variable with scores 
1-2-3-4" rather than as an ordered factor?

If you have advice, I would be delighted to hear it.

-- 
Paul E. Johnson                       email: pauljohn at ku.edu
Dept. of Political Science            http://lark.cc.ku.edu/~pauljohn
1541 Lilac Lane, Rm 504
University of Kansas                  Office: (785) 864-9086
Lawrence, Kansas 66044-3177           FAX: (785) 864-5700