[R] formula question

Wed Mar 18 00:31:25 CET 2009

On 17-Mar-09 23:04:25, Erin Hodgess wrote:
> Dear R People:
> Here is a small data frame and two particular formulas:
>> test.df
>             y  x
> 1  -0.9261650  1
> 2   1.5702700  2
> 3   0.1673920  3
> 4   0.7893085  4
> 5   0.3576875  5
> 6  -1.4620915  6
> 7  -0.5506215  7
> 8  -0.3480292  8
> 9  -1.2344036  9
> 10  0.8502660 10
>> summary(lm(exp(y)~x))
> 
> Call:
> lm(formula = exp(y) ~ x)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max
> -1.6360 -0.6435 -0.4722  0.4215  2.9127
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)   2.1689     0.9782   2.217   0.0574 .
> x            -0.1368     0.1577  -0.868   0.4108
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Residual standard error: 1.432 on 8 degrees of freedom
> Multiple R-squared: 0.08604,    Adjusted R-squared: -0.0282
> F-statistic: 0.7532 on 1 and 8 DF,  p-value: 0.4108
> 
>> summary(lm(I(y^2)~x))
> 
> Call:
> lm(formula = I(y^2) ~ x)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max
> -0.9584 -0.6387 -0.2651  0.5754  1.4412
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)  1.10084    0.62428   1.763    0.116
> x           -0.03813    0.10061  -0.379    0.715
> 
> Residual standard error: 0.9138 on 8 degrees of freedom
> Multiple R-squared: 0.01764,    Adjusted R-squared: -0.1052
> F-statistic: 0.1436 on 1 and 8 DF,  p-value: 0.7146
> 
>>
> 
> These both work just fine.
> 
> My question is:  when do you know to use I() and just the function of
> the variable, please?
> 
> thanks in advance,
> Erin
> PS Happy St Pat's Day!

In the case of your formula you will find it works just as well
without I():

 summary(lm(y^2 ~ x))

Call:
lm(formula = y^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9584 -0.6387 -0.2651  0.5754  1.4412 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.10084    0.62428   1.763    0.116
x           -0.03813    0.10061  -0.379    0.715

The point of I() is that it forces numerical evaluation in an
expression which could be interpreted as a symbolic model formula.

Thus if X1 and X2 were numeric, and you want to regress Y on the
numerical values of X1*X2, then you should use I(X1*X2), since in

  Y ~ X1*X2

this would be interpreted as (essentially) fitting both linear
terms and their interaction (equivalent to product here), namely
corresponding to

  Y = a + b1*X1 + b2*X2 + b12*X1*X2

In order to force the fitted equation to be

  Y = a + b*X1*X2

you would use Y ~ I(X1*X2). This issue does not arise when
a product is on the left-hand side of the model formula, so
you could simply use X1*X2 ~ Y

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 17-Mar-09                                       Time: 23:31:21
------------------------------ XFMail ------------------------------