[R] reference category for factor in regression

Mon Jan 19 22:02:42 CET 2009

Hi Jos,

does explicitly recoding AGE help?

AGE <- factor(c("65+","18-24","18-24","25-34"))
str(AGE)
AGE <- 
factor(c("65+","18-24","18-24","25-34"),levels=c("65+","18-24","25-34"))
str(AGE)

Best,
Stephan

Jos Elkink schrieb:
> Hi all,
> 
> Thanks for the advice.
> 
>> See ?relevel for information on how to reorder the levels of a factor,
>> while being able to specify the reference level.
>> Basically, the first level of the factor is taken as the reference.
> 
> Yes, that is how I always used it. But the problem is, in this
> particular regression R does *not* take the first level as reference.
> In fact, AGE appears twice in the same regression (two different
> interactions) and in one case it selects the 1st category and in
> another case a different one.
> 
>> BTW, you might want to review Frank Harrell's page on why categorizing a
>> continuous variable is not a good idea:
> 
> I most certainly agree, but the categorisation has been imposed in the
> survey itself, so it is all the data I have. I did not design the
> questions :-) ... Thanks for this reference, though, as it is
> certainly interesting to inform my teaching.
> 
>> str(AGE)
>  Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
> 
> So I expect 65+ to be the reference category, but it is not.
> 
> Here is a little bit more R code to show the problem:
> 
>> str(AGE)
>  Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
>> table(LABOUR)
> LABOUR
>    0    1
>  692 1409
>> NONLABOUR <- 1 - LABOUR
>> m <- glm(NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE : LABOUR + AGE : NONLABOUR, family=binomial)
>> m
> 
> Call:  glm(formula = NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE:LABOUR +
>   AGE:NONLABOUR, family = binomial)
> 
> Coefficients:
>             LABOUR           NONLABOUR       LABOUR:AGE65+     LABOUR:AGE18-24
>           -0.35110            -0.30486            -0.11890            -0.66444
>    LABOUR:AGE25-34     LABOUR:AGE35-49     LABOUR:AGE50-64  NONLABOUR:AGE18-24
>           -0.23893            -0.15860                  NA            -0.65655
> NONLABOUR:AGE25-34  NONLABOUR:AGE35-49  NONLABOUR:AGE50-64
>           -0.72815             0.04951             0.17481
> 
> As you can see, 65+ is taken as reference category in the interaction
> with NONLABOUR, but not in the interaction with LABOUR.
> 
> I know glm(NOVOTE ~ LABOUR * AGE, family=binomial) would be a more
> conventional specification, but the above should be equivalent and
> should give me the coefficients and standard errors for the two groups
> (LABOUR and NONLABOUR) separately, rather than for the difference /
> interaction term).
> 
> Perhaps the NA in the above output (which I only notice now) is a hint
> at the problem, but I am not sure why that occurs.
> 
>> table(m$model$AGE, m$model$LABOUR, m$model$NOVOTE)
> , ,  = 0
> 
> 
>           0   1
>   65+   137  24
>   18-24  68 127
>   25-34  59 267
>   35-49  71 298
>   50-64  82 179
> 
> , ,  = 1
> 
> 
>           0   1
>   65+   101  15
>   18-24  26  46
>   25-34  21 148
>   35-49  55 179
>   50-64  72 126
> 
> Anyone any idea? So there must be a reason R decides *not* to use 65+
> as reference in that particular scenario, and I am missing why.
> 
> Jos
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>