[Rd] Apparent bug in behavior of formulas with '-' operator for lm

Mark van der Loo mark.vanderloo at gmail.com
Fri Mar 16 10:21:23 CET 2018

Dear R-developers,

In the 'lm' documentation, the '-' operator is only specified to be used
with -1 (to remove the intercept from the model).

However, the documentation also refers to the 'formula' help file, which
indicates that it is possible to subtract any term. Indeed, the following
works with no problems (the period '.' stands for 'all terms except the

d <- data.frame(x=rnorm(6), y=rnorm(6), z=letters[1:2])
m <- lm(x ~ . -z, data=d)
p <- predict(m,newdata=d)

Now, if I change 'z' so that it has only unique values, and I introduce an
NA in the predicted variable, the following happens:

d <- data.frame(x=rnorm(6),y=rnorm(6),z=letters[1:6])
d$x[1] <- NA
m <- lm(x ~ . -z, data=d)
p <- predict(m, newdata=d)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$xlevels) : factor z has new levels a

It seems a bug to me, although one could argue that 'lm's documentation
does not allow one to expect that the '-' operator should work generally.

If it is a bug I'm happy to report it to bugzilla.

Thanks for all your efforts,

ps: I was not able to test this on R3.4.4 yet, but the NEWS does not
mention fixes related to 'lm' or 'predict'.

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3    yaml_2.1.16

	[[alternative HTML version deleted]]

More information about the R-devel mailing list