[Rd] [External] difference of m1 <- lm(f, data) and update(m1, formula=f)

iuke-tier@ey m@iii@g oii uiow@@edu iuke-tier@ey m@iii@g oii uiow@@edu
Wed Aug 11 17:24:08 CEST 2021


On Wed, 11 Aug 2021, Martin Maechler wrote:

> I'm diverting this from R-help to R-devel,
>
> because I'm asking / musing if and if where we should / could
> change R here (see below).
>
>>>>>> Martin Maechler on 11 Aug 2021 11:51:25 +0200
>
>>>>>> Tim Taylor .. on 08:45:48 +0000 writes:
>
>    >> Manipulating formulas within different models I notice the following:
>
>    >> m1 <- lm(formula = hp ~ cyl, data = mtcars)
>    >> m2 <- update(m1, formula. = hp ~ cyl)
>    >> all.equal(m1, m2)
>    >> #> [1] TRUE
>    >> identical(m1, m2)
>    >> #> [1] FALSE
>    >> waldo::compare(m1, m2)
>    >> #> `old$call[[2]]` is a call
>    >> #> `new$call[[2]]` is an S3 object of class <formula>, a call
>
>    >> I'm aware formulas are a form of call but what I'm unsure
>    >> of is whether there is meaningful difference between the
>    >> two versions of the models?
>
>    > A good question.
>    > In principle, the promise of an update()  method should be to
>    > produce the *same* result as calling the original model-creation
>    > (or more generally object-creation) function call.
>
>    > So, already with identical(), you've shown that this is not
>    > quite the case for simple lm(),
>    > and indeed that is a bit undesirable.
>
>    > To answer your question re "meaningful" difference,
>    > given what I say above is:
>    > No, there shouldn't be any relevant difference, and if there is,
>    > that may considered a bug in the respective update() method,
>    > here update.lm.
>
>    > More about this in the following  R code snippet :
>
> Again, a repr.ex.:
>
> ---0<-------0<-------0<-------0<-------0<-------0<-------0<----
>
> m1 <- lm(formula = hp ~ cyl, data = mtcars)
> m2  <- update(m1, formula. = hp ~ cyl)
> m2a <- update(m1)
> identical(m1, m2a)#>  TRUE !
> ## ==> calling update() & explicitly specifying the formula is "the problem"
>
> identical(m1$call, m2$call) #> [1] FALSE
> noCall <- function(x) x[setdiff(names(x), "call")]
> identical(noCall(m1), noCall(m2))# TRUE!
> ## look closer:
> c1 <- m1$call
> c2 <- m2$call
> str(as.list(c1))
> ## List of 3
> ##  $        : symbol lm
> ##  $ formula: language hp ~ cyl
> ##  $ data   : symbol mtcars
>
> str(as.list(c2))
> ## List of 3
> ##  $        : symbol lm
> ##  $ formula:Class 'formula'  language hp ~ cyl
> ##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
> ##  $ data   : symbol mtcars
>
> identical(c1[-2], c2[-2]) # TRUE ==> so, indeed the difference is *only* in the formula ( = [2]) component
> f1 <- c1$formula
> f2 <- c2$formula
> all.equal(f1,f2) # TRUE
> identical(f1,f2) # FALSE
>
> ## Note that this is typically *not* visible if the user uses
> ## the accessor functions they should :
> identical(formula(m1), formula(m2)) # TRUE !
> ## and indeed, the formula() method for 'lm'  does set the environment:
> stats:::formula.lm
>
> ---0<-------0<-------0<-------0<-------0<-------0<-------0<----
>
> We know that it has been important in  R  the formulas have an
> environment and that's been the only R-core recommended way to
> do non-standard evaluation (!! .. but let's skip that for now !!).
>
> OTOH we have also kept the convention that a formula without
> environment implicitly means its environment
> is .GlobalEnv aka globalenv().
>
> Currently, I think formula() methods then *should* always return
> a formula *with* an environment .. even though that's not
> claimed in the reference, i.e., ?formula.
>
> Also, the print() method for formulas by default does *not* show the
> environment if it is .GlobalEnv, as you can see on that help
> already in the "Usage" section:
>
>     ## S3 method for class 'formula'
>     print(x, showEnv = !identical(e, .GlobalEnv), ...)
>
> Now, I've looked at the update() here, which is update.default()
> and the source code of that currently is
>
> update.formula <- function (old, new, ...)
> {
>    tmp <- .Call(C_updateform, as.formula(old), as.formula(new))
>    ## FIXME?: terms.formula() with "large" unneeded attributes:
>    formula(terms.formula(tmp, simplify = TRUE))
> }
>
> where the important part is the "FIXME" comment (seen in the R
> sources, but no longer in the R function after installation).
>
> My current "idea" is to formalize what we see working here:
> namely allow  update.formula() to *not* set the environment of
> its result *if* that environment would be .GlobalEnv ..
>
> --> I'm starting to test my proposal
> but would still be *very* glad for comments, also contradicting
> ones!

m1$call is the parsed expression for the call to lm(), so
m1$call$formula is the expression that was evaluated to produce the formula.
Typically this will be a call expression with `~` as the function.
It could also be a symbol:

> frm <- hp ~ cyl
> m3 <- lm(formula = frm, data = mtcars)
> m3$call$formula
frm

update.default is creating a new call object by putting in a new
expression for the formula argument. It so happens that putting in a
formula object actually works: The only difference between the AST for
a call of `~` and the formula such a call produces when evaluated is
the class and environment attributes the call adds, and most code that
works with expressions, like eval(), ignores attributes.

It would seem somewhat more consistent if update.default put the
expression that would produce the formula into the call (i.e. stripped
out the two attributes).

But I do not know if there is logic in base R code, never mind package
code, that takes advantage of the attributes on the formula expression
in if they are found. formula() looks in the 'terms' component so would
not be affects, but I don't know if something else might be.

Best,

luke


>
> Martin
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu



More information about the R-devel mailing list