[R] when to use "I", "as is" caret

David Winsemius dwinsemius at comcast.net
Fri Sep 14 16:47:16 CEST 2012


On Sep 14, 2012, at 12:41 AM, agent dunham wrote:

> Dear community, 
> 
> I've check it while working, but just to reassure myself.  Let's say we have
> 2 models: 
> 
> model1 <-  lm(vdep ~ log(v1) + v2 + v3 + I(v4^2) , data = mydata)

If you want to create a second degree polynomial for "proper" statisical inference via a formula, the way forward is:

?poly
model1 <-  lm(vdep ~ log(v1) + v2 + v3 + poly(v4,2) , data = mydata)

You will get orthogonal polynomials, which are different than most people's naive expectations, but they do allow your to fairly assess departures from linearity.

It's interesting to compare two methods with the cars dataset:

Proper use of poly():

> fm <- lm(dist ~ poly(speed, 2), data = cars)
> summary(fm)

Call:
lm(formula = dist ~ poly(speed, 2), data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.720  -9.184  -3.188   4.628  45.152 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       42.980      2.146  20.026  < 2e-16 ***
poly(speed, 2)1  145.552     15.176   9.591 1.21e-12 ***
poly(speed, 2)2   22.996     15.176   1.515    0.136    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 15.18 on 47 degrees of freedom
Multiple R-squared: 0.6673,	Adjusted R-squared: 0.6532 
F-statistic: 47.14 on 2 and 47 DF,  p-value: 5.852e-12 

Improper use of linear and "I-quadratic":

> fm2 <- lm(dist ~ speed+I(speed^2), data = cars)
> summary(fm2)

Call:
lm(formula = dist ~ speed + I(speed^2), data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.720  -9.184  -3.188   4.628  45.152 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.47014   14.81716   0.167    0.868
speed        0.91329    2.03422   0.449    0.656
I(speed^2)   0.09996    0.06597   1.515    0.136

Residual standard error: 15.18 on 47 degrees of freedom
Multiple R-squared: 0.6673,	Adjusted R-squared: 0.6532 
F-statistic: 47.14 on 2 and 47 DF,  p-value: 5.852e-12 

#---------

If you wanted the same results as you would get from I(v4^2) and you were using poly() it would look like :

(z <- poly(1:10, 2, raw=TRUE)[,2])
 [1]   1   4   9  16  25  36  49  64  81 100

I didn't know off whether one could use the raw-poly column within a formula for lm but it seems to work as I expected:

> fm <- lm(dist ~ I(speed^2), data = cars)
> fm

Call:
lm(formula = dist ~ I(speed^2), data = cars)

Coefficients:
(Intercept)   I(speed^2)  
      8.860        0.129  

> fm <- lm(dist ~ poly(speed, 2, raw=TRUE)[,2], data = cars)
> fm

Call:
lm(formula = dist ~ poly(speed, 2, raw = TRUE)[, 2], data = cars)

Coefficients:
                    (Intercept)  poly(speed, 2, raw = TRUE)[, 2]  
                          8.860                            0.129  


(And Uwe's answer covers the rest.)

> model2 <-   lm(vdep ~ log(v1) + v2 + v3 + v4^2, data = mydata)
> 
> So in model1 you really square v4; and in model2,  v4*^2 *doesn't do
> anything, does it? Model2 could be rewritten:
> model2b <-   lm(vdep ~ log(v1) + v2 + v3 + v4, data = mydata) and nothing
> changes, doesn't it?

> 
> This "I" caret is essential with powering or when including transformations
> as I(1/(v2+v3)) but not with log transformation, isn't it?. Is there any
> other transformation where I muss use also this "I", as is caret?
> 

David Winsemius, MD
Alameda, CA, USA




More information about the R-help mailing list