[Rd] lm() takes weights from formula environment

Sun Aug 9 21:07:40 CEST 2020

On 09/08/2020 3:01 p.m., John Mount wrote:
> Doesn't this preclude "y ~ ." style notations?

Yes, but you can use "y ~ . - w".

Duncan Murdoch


> 
>> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>
>> This is fairly clearly documented in ?lm:
>>
>> "All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula."
>>
>> There are lots of possible places to look for weights, but this seems to me like a pretty sensible search order.  In most cases the environment of the formula will have a parent environment chain that eventually leads to the global environment, so (with no conflicts) your strategy of defining w there will sometimes work, but looks pretty unreliable.
>>
>> When you say you want to work around this search order, I think the obvious way is to add your w vector to your d dataframe.  That way it is guaranteed to be found even if there's a conflicting variable in the formula environment, or the global environment.
>>
>> Duncan Murdoch
>>
>> On 09/08/2020 2:13 p.m., John Mount wrote:
>>> I know this programmers can reason this out from R's late parameter evaluation rules PLUS the explicit match.call()/eval() lm() does to work with the passed in formula and data frame. But, from a statistical user point of view this seems to be counter-productive. At best it works as if the user is passing in the name of the weights variable instead of values (I know this is the obvious consequence of NSE).
>>> lm() takes instance weights from the formula environment. Usually that environment is the interactive environment or a close child of the interactive environment and we are lucky enough to have no intervening name collisions so we don't have a problem. However it makes programming over formulas for lm() a bit tricky. Here is an example of the issue.
>>> Is there any recommended discussion on this and how to work around it? In my own work I explicitly set the formula environment and put the weights in that environment.
>>> d <- data.frame(x = 1:3, y = c(3, 3, 4))
>>> w <- c(1, 5, 1)
>>> # works
>>> lm(y ~ x, data = d, weights = w)
>>> # fails, as weights are taken from formul environment
>>> fn <- function() {  # deliberately set up formula with bad value in environment
>>>    w <- c(-1, -1, -1, -1)  # bad weights
>>>    f <- as.formula(y ~ x)  # captures bad weights with as.formula(env = parent.frame()) default
>>>    return(f)
>>> }
>>> lm(fn(), data = d, weights = w)
>>> # Error in model.frame.default(formula = fn(), data = d, weights = w, drop.unused.levels = TRUE) :
>>> #   variable lengths differ (found for '(weights)')
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>