[Rd] inconsistent handling of factor, character, and logical predictors in lm()

Sat Aug 31 17:54:24 CEST 2019

Dear Abby,

> On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdle.a using gmail.com> wrote:
> 
>> I think that it would be better to handle factors, character predictors, and logical predictors consistently.
> 
> "logical predictors" can be regarded as categorical or continuous (i.e. 0 or 1).
> And the model matrix should be the same, either way.

I think that you're mistaking a coincidence for a principle. The coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE. Functions like lm() treat logical predictors as factors, *not* as numerical variables. 

That one would get the same coefficient in either case is a consequence of the coincidence and the fact that the default contrasts for unordered factors are contr.treatment(). For example, if you changed the contrasts option, you'd get a different estimate (though of course a model with the same fit to the data and an equivalent interpretation):

------------ snip --------------

> options(contrasts=c("contr.sum", "contr.poly"))
> m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> m3

Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"), 
    data = iris)

Coefficients:
            (Intercept)              Sepal.Width  I(Species == "setosa")1  
                 2.6672                   0.9418                   0.8898  

> head(model.matrix(m3))
  (Intercept) Sepal.Width I(Species == "setosa")1
1           1         3.5                      -1
2           1         3.0                      -1
3           1         3.2                      -1
4           1         3.1                      -1
5           1         3.6                      -1
6           1         3.9                      -1
> tail(model.matrix(m3))
    (Intercept) Sepal.Width I(Species == "setosa")1
145           1         3.3                       1
146           1         3.0                       1
147           1         2.5                       1
148           1         3.0                       1
149           1         3.4                       1
150           1         3.0                       1

> lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"), data=iris)

Call:
lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species == 
    "setosa"), data = iris)

Coefficients:
                    (Intercept)                      Sepal.Width  as.numeric(Species == "setosa")  
                         3.5571                           0.9418                          -1.7797  

> -2*coef(m3)[3]
I(Species == "setosa")1 
              -1.779657 

------------ snip --------------

> 
> I think the first question to be asked is, which is the best approach, 
> categorical or continuous?
> The continuous approach seems simpler and more efficient to me, but
> output from the categorical approach may be more intuitive, for some
> people.

I think that this misses the point I was trying to make: lm() et al. treat logical variables as factors, not as numerical predictors. One could argue about what's the better approach but not about what lm() does. BTW, I prefer treating a logical predictor as a factor because the predictor is essentially categorical.

> 
> I note that the use factors and characters, doesn't necessarily
> produce consistent output, for $xlevels.
> (Because factors can have their levels re-ordered).

Again, this misses the point: Both factors and character predictors produce elements in $xlevels; logical predictors do not, even though they are treated in the model as factors. That factors have levels that aren't necessarily ordered alphabetically is a reason that I prefer using factors to using character predictors, but this has nothing to do with the point I was trying to make about $xlevels.

Best,
 John

  -------------------------------------------------
  John Fox, Professor Emeritus
  McMaster University
  Hamilton, Ontario, Canada
  Web: http::/socserv.mcmaster.ca/jfox