[Rd] difference of m1 <- lm(f, data) and update(m1, formula=f)

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Wed Aug 11 15:15:15 CEST 2021


I'm diverting this from R-help to R-devel,

because I'm asking / musing if and if where we should / could
change R here (see below).

>>>>> Martin Maechler on 11 Aug 2021 11:51:25 +0200

>>>>> Tim Taylor .. on 08:45:48 +0000 writes:

    >> Manipulating formulas within different models I notice the following:

    >> m1 <- lm(formula = hp ~ cyl, data = mtcars)
    >> m2 <- update(m1, formula. = hp ~ cyl)
    >> all.equal(m1, m2)
    >> #> [1] TRUE
    >> identical(m1, m2)
    >> #> [1] FALSE
    >> waldo::compare(m1, m2)
    >> #> `old$call[[2]]` is a call
    >> #> `new$call[[2]]` is an S3 object of class <formula>, a call

    >> I'm aware formulas are a form of call but what I'm unsure
    >> of is whether there is meaningful difference between the
    >> two versions of the models? 

    > A good question.
    > In principle, the promise of an update()  method should be to
    > produce the *same* result as calling the original model-creation
    > (or more generally object-creation) function call.

    > So, already with identical(), you've shown that this is not
    > quite the case for simple lm(),
    > and indeed that is a bit undesirable.

    > To answer your question re "meaningful" difference,
    > given what I say above is:
    > No, there shouldn't be any relevant difference, and if there is,
    > that may considered a bug in the respective update() method,
    > here update.lm.

    > More about this in the following  R code snippet :

Again, a repr.ex.:

---0<-------0<-------0<-------0<-------0<-------0<-------0<----

m1 <- lm(formula = hp ~ cyl, data = mtcars)
m2  <- update(m1, formula. = hp ~ cyl)
m2a <- update(m1)
identical(m1, m2a)#>  TRUE !
## ==> calling update() & explicitly specifying the formula is "the problem"

identical(m1$call, m2$call) #> [1] FALSE
noCall <- function(x) x[setdiff(names(x), "call")]
identical(noCall(m1), noCall(m2))# TRUE!
## look closer:
c1 <- m1$call
c2 <- m2$call
str(as.list(c1))
## List of 3
##  $        : symbol lm
##  $ formula: language hp ~ cyl
##  $ data   : symbol mtcars

str(as.list(c2))
## List of 3
##  $        : symbol lm
##  $ formula:Class 'formula'  language hp ~ cyl
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
##  $ data   : symbol mtcars

identical(c1[-2], c2[-2]) # TRUE ==> so, indeed the difference is *only* in the formula ( = [2]) component
f1 <- c1$formula
f2 <- c2$formula
all.equal(f1,f2) # TRUE
identical(f1,f2) # FALSE

## Note that this is typically *not* visible if the user uses
## the accessor functions they should :
identical(formula(m1), formula(m2)) # TRUE !
## and indeed, the formula() method for 'lm'  does set the environment:
stats:::formula.lm

---0<-------0<-------0<-------0<-------0<-------0<-------0<----

We know that it has been important in  R  the formulas have an
environment and that's been the only R-core recommended way to
do non-standard evaluation (!! .. but let's skip that for now !!).

OTOH we have also kept the convention that a formula without
environment implicitly means its environment
is .GlobalEnv aka globalenv().

Currently, I think formula() methods then *should* always return
a formula *with* an environment .. even though that's not
claimed in the reference, i.e., ?formula.

Also, the print() method for formulas by default does *not* show the
environment if it is .GlobalEnv, as you can see on that help
already in the "Usage" section:

     ## S3 method for class 'formula'
     print(x, showEnv = !identical(e, .GlobalEnv), ...)
     
Now, I've looked at the update() here, which is update.default()
and the source code of that currently is

update.formula <- function (old, new, ...)
{
    tmp <- .Call(C_updateform, as.formula(old), as.formula(new))
    ## FIXME?: terms.formula() with "large" unneeded attributes:
    formula(terms.formula(tmp, simplify = TRUE))
}

where the important part is the "FIXME" comment (seen in the R
sources, but no longer in the R function after installation).

My current "idea" is to formalize what we see working here:
namely allow  update.formula() to *not* set the environment of
its result *if* that environment would be .GlobalEnv ..

--> I'm starting to test my proposal
but would still be *very* glad for comments, also contradicting
ones!

Martin



More information about the R-devel mailing list