[Rd] Apparent bug in behavior of formulas with '-' operator for lm

Joris Meys jorismeys at gmail.com
Fri Mar 16 13:03:32 CET 2018


It's not a bug per se. It's the effect of removing all observations linked
to a certain level in your data frame. So the output of lm() doesn't
contain a coefficient for level a of z, but your new data contains that
level a. With a small addition, this works again:

d <- data.frame(x=rnorm(12),y=rnorm(12),z=rep(letters[1:6],2))
d$x[1] <- NA
m <- lm(x ~ . -z, data=d)
p <- predict(m, newdata=d)

This is linked to another discussion earlier on stackoverflow :
https://stackoverflow.com/questions/48461980/prediction-in-r-glmm
which lead to an update to lme4 : https://github.com/lme4/lme4/issues/452

The point being that factors in your newdata should have the same levels as
factors in the original data that was used to fit the model. If you add
levels to these factors, it's impossible to use that model to predict for
these new data.

Cheers
Joris

On Fri, Mar 16, 2018 at 10:21 AM, Mark van der Loo <mark.vanderloo at gmail.com
> wrote:

> Dear R-developers,
>
> In the 'lm' documentation, the '-' operator is only specified to be used
> with -1 (to remove the intercept from the model).
>
> However, the documentation also refers to the 'formula' help file, which
> indicates that it is possible to subtract any term. Indeed, the following
> works with no problems (the period '.' stands for 'all terms except the
> lhs'):
>
> d <- data.frame(x=rnorm(6), y=rnorm(6), z=letters[1:2])
> m <- lm(x ~ . -z, data=d)
> p <- predict(m,newdata=d)
>
> Now, if I change 'z' so that it has only unique values, and I introduce an
> NA in the predicted variable, the following happens:
>
> d <- data.frame(x=rnorm(6),y=rnorm(6),z=letters[1:6])
> d$x[1] <- NA
> m <- lm(x ~ . -z, data=d)
> p <- predict(m, newdata=d)
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
> object$xlevels) : factor z has new levels a
>
> It seems a bug to me, although one could argue that 'lm's documentation
> does not allow one to expect that the '-' operator should work generally.
>
> If it is a bug I'm happy to report it to bugzilla.
>
> Thanks for all your efforts,
> Mark
>
> ps: I was not able to test this on R3.4.4 yet, but the NEWS does not
> mention fixes related to 'lm' or 'predict'.
>
>
> > sessionInfo()
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.4 LTS
>
> Matrix products: default
> BLAS: /usr/lib/libblas/libblas.so.3.6.0
> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8
> LC_PAPER=nl_NL.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>  LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.3 tools_3.4.3    yaml_2.1.16
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]



More information about the R-devel mailing list