[Rd] rnorm is not truly random used in the lm function
Victor Tian
tianxu03 at gmail.com
Thu Aug 3 18:33:41 CEST 2017
I did it purely based on the intuition I built from elsewhere and maybe in
R as well.
To summarise, it's basically a matter of evaluation ordering issue.
It looks like the model.matrix() function has a higher precedence over
rnorm(100), i.e., outside in rather than inside out in this specific case?
If the inner parts are evaluated first, as in most of the cases, the two
norm(100) expressions will no longer be the same.
I guess it's because they appear the same to model.matrix()? This would
raise another question, how does model.matrix() judges if two variables are
the same on both sides of the ~ sign? By the input literal?
Please clarify.
Thanks,
Victor
On Thu, Aug 3, 2017 at 12:11 PM, Martin Maechler <maechler at stat.math.ethz.ch
> wrote:
> >>>>> Victor Tian <tianxu03 at gmail.com>
> >>>>> on Thu, 3 Aug 2017 09:49:57 -0400 writes:
>
> > To whom it may concern,
> > I happened to run the following R code just to check the layout of
> the
> > output, but found that the code doesn't work the way I thought it
> should
> > work.
>
> yes, your expectations were wrong.
>
> >> lm(rnorm(100) ~ rnorm(100))
>
> > Call:
> > lm(formula = rnorm(100) ~ rnorm(100))
>
> > Coefficients:
> > (Intercept)
> > -0.07966
>
> > Warning messages:
> > 1: In model.matrix.default(mt, mf, contrasts) :
> > the response appeared on the right-hand side and was dropped
> > 2: In model.matrix.default(mt, mf, contrasts) :
> > problem with term 1 in model.matrix: no columns are assigned
>
>
> > It appears that rnorm(100) produces the same array of numbers on
> both sides
> > of the ~ sign.
>
> Indeed. And all this has nothing to do with lm() but rather with
> how formulas in R have been treated probably "forever".
> [I assume not only in R, but rather since the time formulas
> where introduced into the S language (for "S version 3") a few
> years before R was born. But I can no longer verify or disprove
> this assumption.]
>
> Even more revealing may be this:
>
> > f <- rnorm(9) ~ rnorm(9)
> > str(f)
> Class 'formula' language rnorm(9) ~ rnorm(9)
> ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
> > (mm <- model.matrix(f))
> (Intercept)
> 1 1
> 2 1
> 3 1
> 4 1
> 5 1
> 6 1
> 7 1
> 8 1
> 9 1
> attr(,"assign")
> [1] 0
> Warning messages:
> 1: In model.matrix.default(f) :
> the response appeared on the right-hand side and was dropped
> 2: In model.matrix.default(f) :
> problem with term 1 in model.matrix: no columns are assigned
> >
> ---------
>
> BTW: One of the goals of formulas, notably in R since they got an
> environment attached, is a clean way to deal with non-standard
> evaluation (=: NSE).
> [ Some of us would claim it is the only clean way to deal with NSE in R,
> and all new functionality using NSE should use formulas,
> but recently tidyverse-scholars have claimed to be able to deal
> with it cleanly w/o the use of formulas, but via "tidy evaluation" ]
>
> Using random expressions in a formula is therefore typically not
> a good idea, because you don't realy know when the terms in the
> formula will be evaluated.
> For lm() and all other good formula-based statistical modeling
> functions, the evaluation happens via model.matrix().
>
> As you've noticed from that warning, model.matrix() tries to
> help the user by checking terms and eliminating those that
> appear on both sides of the '~'.
> This has been documented on the help page [ ?model.matrix ] for
> (almost exactly 14) years, the "Details:" section ending with
>
> _> By convention, if the response variable also appears on the
> _> right-hand side of the formula it is dropped (with a warning),
> _> although interactions involving the term are retained.
>
>
> I hope this explains the issue.
> And yes: Do *not* use rnorm() in formulas.
>
> Martin
>
> --
> Martin Mächler
> Seminar für Statistik, ETH Zürich // R Core Team
>
--
*Xu Tian*
[[alternative HTML version deleted]]
More information about the R-devel
mailing list