[Rd] Apparent bug in behavior of formulas with '-' operator for lm

Fri Mar 16 14:21:17 CET 2018

Thanks, Joris,

This clarifies at least where exactly it comes from. I still find the
high-level behavior of 'predict' very counter-intuitive as the estimated
model contains no coefficients in 'z', but I think we agree on that.

I am not sure how much trouble it would be to improve this behavior, but
perhaps one of the core authors can have a look at it.

Best,
Mark

Op vr 16 mrt. 2018 om 13:22 schreef Joris Meys <jorismeys at gmail.com>:

> Technically it is used as a predictor in the model. The information is
> contained in terms :
>
> > terms(x ~ . - z, data = d)
> x ~ (y + z) - z
> attr(,"variables")
> list(x, y, z)
> attr(,"factors")
>   y
> x 0
> y 1
> z 0
> attr(,"term.labels")
> [1] "y"
> attr(,"order")
> [1] 1
> attr(,"intercept")
> [1] 1
> attr(,"response")
> [1] 1
> attr(,".Environment")
>
> And the model.frame contains it :
>
> > head(model.frame(x ~ . - z, data = d))
>             x          y z
> 2 -0.06022984 -0.4483109 b
> 3  1.25293390  0.2687065 c
> 4 -1.11811090  0.8016076 d
> 5 -0.75521720 -0.7484931 e
> 6  0.93037156  0.4128456 f
> 7  1.32052028 -1.6609043 a
>
> It is at the construction of the model.matrix that z disappears, but the
> contrasts information for z is still attached :
>
> > attr(model.matrix(x ~ . - z, data = d),"contrasts")
> $z
> [1] "contr.treatment"
>
> As you can see from the error you printed, it is model.frame() complaining
> about it. In this case it wouldn't be necessary, but it is documented
> behaviour of model.frame. Which is why I didn't say "this is not a bug",
> but "this is not a bug per se". Meaning that this is not optimal behaviour
> and might not what you expect, but it follows the documentation of the
> underlying functions.
>
> Solving it would require a bypass of model.frame() to construct the
> correct model,matrix for the new predictions, and that's far from trivial
> as model.matrix() itself depends on model.frame().
>
> Cheers
> Joris
>
>
>
>
>
> On Fri, Mar 16, 2018 at 1:09 PM, Mark van der Loo <
> mark.vanderloo at gmail.com> wrote:
>
>> Joris, the point is that 'z' is NOT used as a predictor in the model.
>> Therefore it should not affect predictions. Also, I find it suspicious that
>> the error only occurs when the response variable conitains missings and 'z'
>> is unique (I have tested several other cases to confirm this).
>>
>> -Mark
>>
>> Op vr 16 mrt. 2018 om 13:03 schreef Joris Meys <jorismeys at gmail.com>:
>>
>>> It's not a bug per se. It's the effect of removing all observations
>>> linked to a certain level in your data frame. So the output of lm() doesn't
>>> contain a coefficient for level a of z, but your new data contains that
>>> level a. With a small addition, this works again:
>>>
>>> d <- data.frame(x=rnorm(12),y=rnorm(12),z=rep(letters[1:6],2))
>>>
>>> d$x[1] <- NA
>>> m <- lm(x ~ . -z, data=d)
>>> p <- predict(m, newdata=d)
>>>
>>> This is linked to another discussion earlier on stackoverflow :
>>> https://stackoverflow.com/questions/48461980/prediction-in-r-glmm
>>> which lead to an update to lme4 :
>>> https://github.com/lme4/lme4/issues/452
>>>
>>> The point being that factors in your newdata should have the same levels
>>> as factors in the original data that was used to fit the model. If you add
>>> levels to these factors, it's impossible to use that model to predict for
>>> these new data.
>>>
>>> Cheers
>>> Joris
>>>
>>> On Fri, Mar 16, 2018 at 10:21 AM, Mark van der Loo <
>>> mark.vanderloo at gmail.com> wrote:
>>>
>>>> Dear R-developers,
>>>>
>>>> In the 'lm' documentation, the '-' operator is only specified to be used
>>>> with -1 (to remove the intercept from the model).
>>>>
>>>> However, the documentation also refers to the 'formula' help file, which
>>>> indicates that it is possible to subtract any term. Indeed, the
>>>> following
>>>> works with no problems (the period '.' stands for 'all terms except the
>>>> lhs'):
>>>>
>>>> d <- data.frame(x=rnorm(6), y=rnorm(6), z=letters[1:2])
>>>> m <- lm(x ~ . -z, data=d)
>>>> p <- predict(m,newdata=d)
>>>>
>>>> Now, if I change 'z' so that it has only unique values, and I introduce
>>>> an
>>>> NA in the predicted variable, the following happens:
>>>>
>>>> d <- data.frame(x=rnorm(6),y=rnorm(6),z=letters[1:6])
>>>> d$x[1] <- NA
>>>> m <- lm(x ~ . -z, data=d)
>>>> p <- predict(m, newdata=d)
>>>> Error in model.frame.default(Terms, newdata, na.action = na.action,
>>>> xlev =
>>>> object$xlevels) : factor z has new levels a
>>>>
>>>> It seems a bug to me, although one could argue that 'lm's documentation
>>>> does not allow one to expect that the '-' operator should work
>>>> generally.
>>>>
>>>> If it is a bug I'm happy to report it to bugzilla.
>>>>
>>>> Thanks for all your efforts,
>>>> Mark
>>>>
>>>> ps: I was not able to test this on R3.4.4 yet, but the NEWS does not
>>>> mention fixes related to 'lm' or 'predict'.
>>>>
>>>>
>>>> > sessionInfo()
>>>> R version 3.4.3 (2017-11-30)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>> Running under: Ubuntu 16.04.4 LTS
>>>>
>>>> Matrix products: default
>>>> BLAS: /usr/lib/libblas/libblas.so.3.6.0
>>>> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>  LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>  [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>> LC_PAPER=nl_NL.UTF-8       LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>  LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] compiler_3.4.3 tools_3.4.3    yaml_2.1.16
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>>
>>>
>>> --
>>> Joris Meys
>>> Statistical consultant
>>>
>>> Department of Data Analysis and Mathematical Modelling
>>> Ghent University
>>> Coupure Links 653, B-9000 Gent (Belgium)
>>>
>>> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>
>>>
>>> -----------
>>> Biowiskundedagen 2017-2018
>>> http://www.biowiskundedagen.ugent.be/
>>>
>>> -------------------------------
>>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>>
>>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Department of Data Analysis and Mathematical Modelling
> Ghent University
> Coupure Links 653, B-9000 Gent (Belgium)
>
> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>
>
> -----------
> Biowiskundedagen 2017-2018
> http://www.biowiskundedagen.ugent.be/
>
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>

	[[alternative HTML version deleted]]