[R] OLS variables

Mon Nov 7 10:05:15 CET 2005

On Sun, 6 Nov 2005, Kjetil Brinchmann halvorsen wrote:

> John Fox wrote:
>>
>> I assume that you're using lm() to fit the model, and that you don't really
>> want *all* of the interactions among 20 predictors: You'd need quite a lot
>> of data to fit a model with 2^20 terms in it, and might have trouble
>> interpreting the results.
>>
>> If you know which interactions you're looking for, then why not specify them
>> directly, as in lm(y ~  x1*x2 + x3*x4*x5 + etc.)? On the other hand, it you
>> want to include all interactions, say, up to three-way, and you've put the
>> variables in a data frame, then lm(y ~ .^3, data=DataFrame) will do it.
>
> This is nice with factors, but with continuous variables, and need of a
> response-surface type, of model, will not do. For instance, with
> variables x, y, z in data frame dat
>    lm( y ~ (x+z)^2, data=dat )
> gives a model mwith the terms x, z and x*z, not the square terms.
> There is a need for a semi-automatic way to get these, for instance,
> use poly() or polym() as in:
>
> lm(y ~ polym(x,z,degree=2), data=dat)

This is an R-S difference (FAQ 3.3.2).  R's formula parser always takes 
x^2 = x whereas the S one does so only for factors.  This makes sense it 
you interpret `interaction' strictly as in John's description - S chose 
to see an interaction of any two continuous variables as multiplication
(something which puzzled me when I first encountered it, as it was not 
well documented back in 1991).

I have often wondered if this difference was thought to be an improvement, 
or if it just a different implementation of the Rogers-Wilkinson syntax.
Should we consider changing it?

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595