[Rd] inconsistent handling of factor, character, and logical predictors in lm()
Fox, John
j|ox @end|ng |rom mcm@@ter@c@
Sat Aug 31 17:54:24 CEST 2019
Dear Abby,
> On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdle.a using gmail.com> wrote:
>
>> I think that it would be better to handle factors, character predictors, and logical predictors consistently.
>
> "logical predictors" can be regarded as categorical or continuous (i.e. 0 or 1).
> And the model matrix should be the same, either way.
I think that you're mistaking a coincidence for a principle. The coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE. Functions like lm() treat logical predictors as factors, *not* as numerical variables.
That one would get the same coefficient in either case is a consequence of the coincidence and the fact that the default contrasts for unordered factors are contr.treatment(). For example, if you changed the contrasts option, you'd get a different estimate (though of course a model with the same fit to the data and an equivalent interpretation):
------------ snip --------------
> options(contrasts=c("contr.sum", "contr.poly"))
> m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> m3
Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
data = iris)
Coefficients:
(Intercept) Sepal.Width I(Species == "setosa")1
2.6672 0.9418 0.8898
> head(model.matrix(m3))
(Intercept) Sepal.Width I(Species == "setosa")1
1 1 3.5 -1
2 1 3.0 -1
3 1 3.2 -1
4 1 3.1 -1
5 1 3.6 -1
6 1 3.9 -1
> tail(model.matrix(m3))
(Intercept) Sepal.Width I(Species == "setosa")1
145 1 3.3 1
146 1 3.0 1
147 1 2.5 1
148 1 3.0 1
149 1 3.4 1
150 1 3.0 1
> lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"), data=iris)
Call:
lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
"setosa"), data = iris)
Coefficients:
(Intercept) Sepal.Width as.numeric(Species == "setosa")
3.5571 0.9418 -1.7797
> -2*coef(m3)[3]
I(Species == "setosa")1
-1.779657
------------ snip --------------
>
> I think the first question to be asked is, which is the best approach,
> categorical or continuous?
> The continuous approach seems simpler and more efficient to me, but
> output from the categorical approach may be more intuitive, for some
> people.
I think that this misses the point I was trying to make: lm() et al. treat logical variables as factors, not as numerical predictors. One could argue about what's the better approach but not about what lm() does. BTW, I prefer treating a logical predictor as a factor because the predictor is essentially categorical.
>
> I note that the use factors and characters, doesn't necessarily
> produce consistent output, for $xlevels.
> (Because factors can have their levels re-ordered).
Again, this misses the point: Both factors and character predictors produce elements in $xlevels; logical predictors do not, even though they are treated in the model as factors. That factors have levels that aren't necessarily ordered alphabetically is a reason that I prefer using factors to using character predictors, but this has nothing to do with the point I was trying to make about $xlevels.
Best,
John
-------------------------------------------------
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: http::/socserv.mcmaster.ca/jfox
More information about the R-devel
mailing list