# [Rd] inconsistent handling of factor, character, and logical predictors in lm()

Fox, John j|ox @end|ng |rom mcm@@ter@c@
Sat Aug 31 17:54:24 CEST 2019

```Dear Abby,

> On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdle.a using gmail.com> wrote:
>
>> I think that it would be better to handle factors, character predictors, and logical predictors consistently.
>
> "logical predictors" can be regarded as categorical or continuous (i.e. 0 or 1).
> And the model matrix should be the same, either way.

I think that you're mistaking a coincidence for a principle. The coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE. Functions like lm() treat logical predictors as factors, *not* as numerical variables.

That one would get the same coefficient in either case is a consequence of the coincidence and the fact that the default contrasts for unordered factors are contr.treatment(). For example, if you changed the contrasts option, you'd get a different estimate (though of course a model with the same fit to the data and an equivalent interpretation):

------------ snip --------------

> options(contrasts=c("contr.sum", "contr.poly"))
> m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> m3

Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
data = iris)

Coefficients:
(Intercept)              Sepal.Width  I(Species == "setosa")1
2.6672                   0.9418                   0.8898

(Intercept) Sepal.Width I(Species == "setosa")1
1           1         3.5                      -1
2           1         3.0                      -1
3           1         3.2                      -1
4           1         3.1                      -1
5           1         3.6                      -1
6           1         3.9                      -1
> tail(model.matrix(m3))
(Intercept) Sepal.Width I(Species == "setosa")1
145           1         3.3                       1
146           1         3.0                       1
147           1         2.5                       1
148           1         3.0                       1
149           1         3.4                       1
150           1         3.0                       1

> lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"), data=iris)

Call:
lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
"setosa"), data = iris)

Coefficients:
(Intercept)                      Sepal.Width  as.numeric(Species == "setosa")
3.5571                           0.9418                          -1.7797

> -2*coef(m3)
I(Species == "setosa")1
-1.779657

------------ snip --------------

>
> I think the first question to be asked is, which is the best approach,
> categorical or continuous?
> The continuous approach seems simpler and more efficient to me, but
> output from the categorical approach may be more intuitive, for some
> people.

I think that this misses the point I was trying to make: lm() et al. treat logical variables as factors, not as numerical predictors. One could argue about what's the better approach but not about what lm() does. BTW, I prefer treating a logical predictor as a factor because the predictor is essentially categorical.

>
> I note that the use factors and characters, doesn't necessarily
> produce consistent output, for \$xlevels.
> (Because factors can have their levels re-ordered).

Again, this misses the point: Both factors and character predictors produce elements in \$xlevels; logical predictors do not, even though they are treated in the model as factors. That factors have levels that aren't necessarily ordered alphabetically is a reason that I prefer using factors to using character predictors, but this has nothing to do with the point I was trying to make about \$xlevels.

Best,
John

-------------------------------------------------
John Fox, Professor Emeritus
McMaster University