[Rd] inconsistent handling of factor, character, and logical predictors in lm()

William Dunlap wdun|@p @end|ng |rom t|bco@com
Sat Aug 31 19:21:01 CEST 2019


> Functions like lm() treat logical predictors as factors, *not* as
numerical variables.

Not quite.  A factor with all elements the same causes lm() to give an
error while a logical of all TRUEs or all FALSEs just omits it from the
model (it gets a coefficient of NA).  This is a fairly common situation
when you fit models to subsets of a big data.frame.  This is an argument
for fixing the single-valued-factor problem, which would become more
noticeable if logicals were treated as factors.

 > d <- data.frame(Age=c(2,4,6,8,10), Weight=c(878, 890, 930, 800, 750),
Diseased=c(FALSE,FALSE,FALSE,TRUE,TRUE))
> coef(lm(data=d, Weight ~ Age + Diseased))
 (Intercept)          Age DiseasedTRUE
    877.7333       5.4000    -151.3333
> coef(lm(data=d, Weight ~ Age + factor(Diseased)))
         (Intercept)                  Age factor(Diseased)TRUE
            877.7333               5.4000            -151.3333
> coef(lm(data=d, Weight ~ Age + Diseased, subset=Age<7))
 (Intercept)          Age DiseasedTRUE
    847.3333      13.0000           NA
> coef(lm(data=d, Weight ~ Age + factor(Diseased), subset=Age<7))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more levels
> coef(lm(data=d, Weight ~ Age + factor(Diseased, levels=c(FALSE,TRUE)),
subset=Age<7))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more levels

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Sat, Aug 31, 2019 at 8:54 AM Fox, John <jfox using mcmaster.ca> wrote:

> Dear Abby,
>
> > On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdle.a using gmail.com> wrote:
> >
> >> I think that it would be better to handle factors, character
> predictors, and logical predictors consistently.
> >
> > "logical predictors" can be regarded as categorical or continuous (i.e.
> 0 or 1).
> > And the model matrix should be the same, either way.
>
> I think that you're mistaking a coincidence for a principle. The
> coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE.
> Functions like lm() treat logical predictors as factors, *not* as numerical
> variables.
>
> That one would get the same coefficient in either case is a consequence of
> the coincidence and the fact that the default contrasts for unordered
> factors are contr.treatment(). For example, if you changed the contrasts
> option, you'd get a different estimate (though of course a model with the
> same fit to the data and an equivalent interpretation):
>
> ------------ snip --------------
>
> > options(contrasts=c("contr.sum", "contr.poly"))
> > m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> > m3
>
> Call:
> lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
>     data = iris)
>
> Coefficients:
>             (Intercept)              Sepal.Width  I(Species == "setosa")1
>                  2.6672                   0.9418                   0.8898
>
> > head(model.matrix(m3))
>   (Intercept) Sepal.Width I(Species == "setosa")1
> 1           1         3.5                      -1
> 2           1         3.0                      -1
> 3           1         3.2                      -1
> 4           1         3.1                      -1
> 5           1         3.6                      -1
> 6           1         3.9                      -1
> > tail(model.matrix(m3))
>     (Intercept) Sepal.Width I(Species == "setosa")1
> 145           1         3.3                       1
> 146           1         3.0                       1
> 147           1         2.5                       1
> 148           1         3.0                       1
> 149           1         3.4                       1
> 150           1         3.0                       1
>
> > lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"),
> data=iris)
>
> Call:
> lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
>     "setosa"), data = iris)
>
> Coefficients:
>                     (Intercept)                      Sepal.Width
> as.numeric(Species == "setosa")
>                          3.5571                           0.9418
>                 -1.7797
>
> > -2*coef(m3)[3]
> I(Species == "setosa")1
>               -1.779657
>
> ------------ snip --------------
>
>
> >
> > I think the first question to be asked is, which is the best approach,
> > categorical or continuous?
> > The continuous approach seems simpler and more efficient to me, but
> > output from the categorical approach may be more intuitive, for some
> > people.
>
> I think that this misses the point I was trying to make: lm() et al. treat
> logical variables as factors, not as numerical predictors. One could argue
> about what's the better approach but not about what lm() does. BTW, I
> prefer treating a logical predictor as a factor because the predictor is
> essentially categorical.
>
> >
> > I note that the use factors and characters, doesn't necessarily
> > produce consistent output, for $xlevels.
> > (Because factors can have their levels re-ordered).
>
> Again, this misses the point: Both factors and character predictors
> produce elements in $xlevels; logical predictors do not, even though they
> are treated in the model as factors. That factors have levels that aren't
> necessarily ordered alphabetically is a reason that I prefer using factors
> to using character predictors, but this has nothing to do with the point I
> was trying to make about $xlevels.
>
> Best,
>  John
>
>   -------------------------------------------------
>   John Fox, Professor Emeritus
>   McMaster University
>   Hamilton, Ontario, Canada
>   Web: http::/socserv.mcmaster.ca/jfox
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list