[Rd] inconsistent handling of factor, character, and logical predictors in lm()
William Dunlap
wdun|@p @end|ng |rom t|bco@com
Sat Aug 31 19:21:01 CEST 2019
> Functions like lm() treat logical predictors as factors, *not* as
numerical variables.
Not quite. A factor with all elements the same causes lm() to give an
error while a logical of all TRUEs or all FALSEs just omits it from the
model (it gets a coefficient of NA). This is a fairly common situation
when you fit models to subsets of a big data.frame. This is an argument
for fixing the single-valued-factor problem, which would become more
noticeable if logicals were treated as factors.
> d <- data.frame(Age=c(2,4,6,8,10), Weight=c(878, 890, 930, 800, 750),
Diseased=c(FALSE,FALSE,FALSE,TRUE,TRUE))
> coef(lm(data=d, Weight ~ Age + Diseased))
(Intercept) Age DiseasedTRUE
877.7333 5.4000 -151.3333
> coef(lm(data=d, Weight ~ Age + factor(Diseased)))
(Intercept) Age factor(Diseased)TRUE
877.7333 5.4000 -151.3333
> coef(lm(data=d, Weight ~ Age + Diseased, subset=Age<7))
(Intercept) Age DiseasedTRUE
847.3333 13.0000 NA
> coef(lm(data=d, Weight ~ Age + factor(Diseased), subset=Age<7))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
> coef(lm(data=d, Weight ~ Age + factor(Diseased, levels=c(FALSE,TRUE)),
subset=Age<7))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Sat, Aug 31, 2019 at 8:54 AM Fox, John <jfox using mcmaster.ca> wrote:
> Dear Abby,
>
> > On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdle.a using gmail.com> wrote:
> >
> >> I think that it would be better to handle factors, character
> predictors, and logical predictors consistently.
> >
> > "logical predictors" can be regarded as categorical or continuous (i.e.
> 0 or 1).
> > And the model matrix should be the same, either way.
>
> I think that you're mistaking a coincidence for a principle. The
> coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE.
> Functions like lm() treat logical predictors as factors, *not* as numerical
> variables.
>
> That one would get the same coefficient in either case is a consequence of
> the coincidence and the fact that the default contrasts for unordered
> factors are contr.treatment(). For example, if you changed the contrasts
> option, you'd get a different estimate (though of course a model with the
> same fit to the data and an equivalent interpretation):
>
> ------------ snip --------------
>
> > options(contrasts=c("contr.sum", "contr.poly"))
> > m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> > m3
>
> Call:
> lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
> data = iris)
>
> Coefficients:
> (Intercept) Sepal.Width I(Species == "setosa")1
> 2.6672 0.9418 0.8898
>
> > head(model.matrix(m3))
> (Intercept) Sepal.Width I(Species == "setosa")1
> 1 1 3.5 -1
> 2 1 3.0 -1
> 3 1 3.2 -1
> 4 1 3.1 -1
> 5 1 3.6 -1
> 6 1 3.9 -1
> > tail(model.matrix(m3))
> (Intercept) Sepal.Width I(Species == "setosa")1
> 145 1 3.3 1
> 146 1 3.0 1
> 147 1 2.5 1
> 148 1 3.0 1
> 149 1 3.4 1
> 150 1 3.0 1
>
> > lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"),
> data=iris)
>
> Call:
> lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
> "setosa"), data = iris)
>
> Coefficients:
> (Intercept) Sepal.Width
> as.numeric(Species == "setosa")
> 3.5571 0.9418
> -1.7797
>
> > -2*coef(m3)[3]
> I(Species == "setosa")1
> -1.779657
>
> ------------ snip --------------
>
>
> >
> > I think the first question to be asked is, which is the best approach,
> > categorical or continuous?
> > The continuous approach seems simpler and more efficient to me, but
> > output from the categorical approach may be more intuitive, for some
> > people.
>
> I think that this misses the point I was trying to make: lm() et al. treat
> logical variables as factors, not as numerical predictors. One could argue
> about what's the better approach but not about what lm() does. BTW, I
> prefer treating a logical predictor as a factor because the predictor is
> essentially categorical.
>
> >
> > I note that the use factors and characters, doesn't necessarily
> > produce consistent output, for $xlevels.
> > (Because factors can have their levels re-ordered).
>
> Again, this misses the point: Both factors and character predictors
> produce elements in $xlevels; logical predictors do not, even though they
> are treated in the model as factors. That factors have levels that aren't
> necessarily ordered alphabetically is a reason that I prefer using factors
> to using character predictors, but this has nothing to do with the point I
> was trying to make about $xlevels.
>
> Best,
> John
>
> -------------------------------------------------
> John Fox, Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: http::/socserv.mcmaster.ca/jfox
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list