[R] reference category for factor in regression
Stephan Kolassa
Stephan.Kolassa at gmx.de
Mon Jan 19 22:02:42 CET 2009
Hi Jos,
does explicitly recoding AGE help?
AGE <- factor(c("65+","18-24","18-24","25-34"))
str(AGE)
AGE <-
factor(c("65+","18-24","18-24","25-34"),levels=c("65+","18-24","25-34"))
str(AGE)
Best,
Stephan
Jos Elkink schrieb:
> Hi all,
>
> Thanks for the advice.
>
>> See ?relevel for information on how to reorder the levels of a factor,
>> while being able to specify the reference level.
>> Basically, the first level of the factor is taken as the reference.
>
> Yes, that is how I always used it. But the problem is, in this
> particular regression R does *not* take the first level as reference.
> In fact, AGE appears twice in the same regression (two different
> interactions) and in one case it selects the 1st category and in
> another case a different one.
>
>> BTW, you might want to review Frank Harrell's page on why categorizing a
>> continuous variable is not a good idea:
>
> I most certainly agree, but the categorisation has been imposed in the
> survey itself, so it is all the data I have. I did not design the
> questions :-) ... Thanks for this reference, though, as it is
> certainly interesting to inform my teaching.
>
>> str(AGE)
> Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
>
> So I expect 65+ to be the reference category, but it is not.
>
> Here is a little bit more R code to show the problem:
>
>> str(AGE)
> Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
>> table(LABOUR)
> LABOUR
> 0 1
> 692 1409
>> NONLABOUR <- 1 - LABOUR
>> m <- glm(NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE : LABOUR + AGE : NONLABOUR, family=binomial)
>> m
>
> Call: glm(formula = NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE:LABOUR +
> AGE:NONLABOUR, family = binomial)
>
> Coefficients:
> LABOUR NONLABOUR LABOUR:AGE65+ LABOUR:AGE18-24
> -0.35110 -0.30486 -0.11890 -0.66444
> LABOUR:AGE25-34 LABOUR:AGE35-49 LABOUR:AGE50-64 NONLABOUR:AGE18-24
> -0.23893 -0.15860 NA -0.65655
> NONLABOUR:AGE25-34 NONLABOUR:AGE35-49 NONLABOUR:AGE50-64
> -0.72815 0.04951 0.17481
>
> As you can see, 65+ is taken as reference category in the interaction
> with NONLABOUR, but not in the interaction with LABOUR.
>
> I know glm(NOVOTE ~ LABOUR * AGE, family=binomial) would be a more
> conventional specification, but the above should be equivalent and
> should give me the coefficients and standard errors for the two groups
> (LABOUR and NONLABOUR) separately, rather than for the difference /
> interaction term).
>
> Perhaps the NA in the above output (which I only notice now) is a hint
> at the problem, but I am not sure why that occurs.
>
>> table(m$model$AGE, m$model$LABOUR, m$model$NOVOTE)
> , , = 0
>
>
> 0 1
> 65+ 137 24
> 18-24 68 127
> 25-34 59 267
> 35-49 71 298
> 50-64 82 179
>
> , , = 1
>
>
> 0 1
> 65+ 101 15
> 18-24 26 46
> 25-34 21 148
> 35-49 55 179
> 50-64 72 126
>
> Anyone any idea? So there must be a reason R decides *not* to use 65+
> as reference in that particular scenario, and I am missing why.
>
> Jos
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list