[R] reference category for factor in regression
Jos Elkink
jos.elkink at ucd.ie
Mon Jan 19 21:22:10 CET 2009
Hi all,
Thanks for the advice.
> See ?relevel for information on how to reorder the levels of a factor,
> while being able to specify the reference level.
> Basically, the first level of the factor is taken as the reference.
Yes, that is how I always used it. But the problem is, in this
particular regression R does *not* take the first level as reference.
In fact, AGE appears twice in the same regression (two different
interactions) and in one case it selects the 1st category and in
another case a different one.
> BTW, you might want to review Frank Harrell's page on why categorizing a
> continuous variable is not a good idea:
I most certainly agree, but the categorisation has been imposed in the
survey itself, so it is all the data I have. I did not design the
questions :-) ... Thanks for this reference, though, as it is
certainly interesting to inform my teaching.
> str(AGE)
Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
So I expect 65+ to be the reference category, but it is not.
Here is a little bit more R code to show the problem:
> str(AGE)
Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
> table(LABOUR)
LABOUR
0 1
692 1409
> NONLABOUR <- 1 - LABOUR
> m <- glm(NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE : LABOUR + AGE : NONLABOUR, family=binomial)
> m
Call: glm(formula = NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE:LABOUR +
AGE:NONLABOUR, family = binomial)
Coefficients:
LABOUR NONLABOUR LABOUR:AGE65+ LABOUR:AGE18-24
-0.35110 -0.30486 -0.11890 -0.66444
LABOUR:AGE25-34 LABOUR:AGE35-49 LABOUR:AGE50-64 NONLABOUR:AGE18-24
-0.23893 -0.15860 NA -0.65655
NONLABOUR:AGE25-34 NONLABOUR:AGE35-49 NONLABOUR:AGE50-64
-0.72815 0.04951 0.17481
As you can see, 65+ is taken as reference category in the interaction
with NONLABOUR, but not in the interaction with LABOUR.
I know glm(NOVOTE ~ LABOUR * AGE, family=binomial) would be a more
conventional specification, but the above should be equivalent and
should give me the coefficients and standard errors for the two groups
(LABOUR and NONLABOUR) separately, rather than for the difference /
interaction term).
Perhaps the NA in the above output (which I only notice now) is a hint
at the problem, but I am not sure why that occurs.
> table(m$model$AGE, m$model$LABOUR, m$model$NOVOTE)
, , = 0
0 1
65+ 137 24
18-24 68 127
25-34 59 267
35-49 71 298
50-64 82 179
, , = 1
0 1
65+ 101 15
18-24 26 46
25-34 21 148
35-49 55 179
50-64 72 126
Anyone any idea? So there must be a reason R decides *not* to use 65+
as reference in that particular scenario, and I am missing why.
Jos
More information about the R-help
mailing list