[R] reference category for factor in regression

Mon Jan 19 21:22:10 CET 2009

Hi all,

Thanks for the advice.

> See ?relevel for information on how to reorder the levels of a factor,
> while being able to specify the reference level.
> Basically, the first level of the factor is taken as the reference.

Yes, that is how I always used it. But the problem is, in this
particular regression R does *not* take the first level as reference.
In fact, AGE appears twice in the same regression (two different
interactions) and in one case it selects the 1st category and in
another case a different one.

> BTW, you might want to review Frank Harrell's page on why categorizing a
> continuous variable is not a good idea:

I most certainly agree, but the categorisation has been imposed in the
survey itself, so it is all the data I have. I did not design the
questions :-) ... Thanks for this reference, though, as it is
certainly interesting to inform my teaching.

> str(AGE)
 Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...

So I expect 65+ to be the reference category, but it is not.

Here is a little bit more R code to show the problem:

> str(AGE)
 Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
> table(LABOUR)
LABOUR
   0    1
 692 1409
> NONLABOUR <- 1 - LABOUR
> m <- glm(NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE : LABOUR + AGE : NONLABOUR, family=binomial)
> m

Call:  glm(formula = NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE:LABOUR +
  AGE:NONLABOUR, family = binomial)

Coefficients:
            LABOUR           NONLABOUR       LABOUR:AGE65+     LABOUR:AGE18-24
          -0.35110            -0.30486            -0.11890            -0.66444
   LABOUR:AGE25-34     LABOUR:AGE35-49     LABOUR:AGE50-64  NONLABOUR:AGE18-24
          -0.23893            -0.15860                  NA            -0.65655
NONLABOUR:AGE25-34  NONLABOUR:AGE35-49  NONLABOUR:AGE50-64
          -0.72815             0.04951             0.17481

As you can see, 65+ is taken as reference category in the interaction
with NONLABOUR, but not in the interaction with LABOUR.

I know glm(NOVOTE ~ LABOUR * AGE, family=binomial) would be a more
conventional specification, but the above should be equivalent and
should give me the coefficients and standard errors for the two groups
(LABOUR and NONLABOUR) separately, rather than for the difference /
interaction term).

Perhaps the NA in the above output (which I only notice now) is a hint
at the problem, but I am not sure why that occurs.

> table(m$model$AGE, m$model$LABOUR, m$model$NOVOTE)
, ,  = 0

          0   1
  65+   137  24
  18-24  68 127
  25-34  59 267
  35-49  71 298
  50-64  82 179

, ,  = 1

          0   1
  65+   101  15
  18-24  26  46
  25-34  21 148
  35-49  55 179
  50-64  72 126

Anyone any idea? So there must be a reason R decides *not* to use 65+
as reference in that particular scenario, and I am missing why.

Jos