[R] lm -- significance of x coefficient when I(x^2) is used

Bill Venables Bill.Venables at cmis.csiro.au
Tue Sep 26 14:33:51 CEST 2000

At 10:12 26/09/00 +0200, Michael Pronath wrote:
> In "Modern Applied Statistics with S-Plus" 3rd ed., footnote on page 153
> regarding a model lm(Gas~Insul/(Temp+I(Temp^2))-1,whiteside), I read
>     "Notice that when the quadratic terms are present, first degree
>      coefficients mean 'the slope of the curve at temperature zero', so a
>      non-significant value does not mean that the linear term is not
>      needed. Removing the non-significant linear term for the 'after'
>      group, for example, would be unjustified."

I accept full responsibilty...  and I am sticking to my guns.  I think this
is a crucial point of inference often misunderstood.  It is at the core of
the reason why putting significance stars routinely on t-statistics is NOT
a good idea - by doing so you encourage confusion and unjustified inferences.

> AFAIK, t-test for significance of a coefficient is not based on the
> assumption that the variables of the linear model are "independent".  

Quite correct, it is not.

> What
> if I only got the model matrix X and I don't know, that one column is
> simply the square of another: Do I have to examine the model matrix for
> polynomial dependencies between its columns, to know if t-test significance
> is "significant"?

Let me start to answer by asking you "What does a `significant' t-test
result mean?"   To me it means that if someone were to pose the null
hypothesis that the mean (or in this case regression coefficient) were
zero, you would have, by convention, strong enough evidence to reject it.
If the result were `non-significant', by contrast, it does NOT allow you to
assert that the regression coefficient IS zero, it only means that you do
not have enough EVIDENCE to reject such a claim.  The real question is
whether or not it is a claim anyone would have good reason to make in the
first place - sometimes it would be, sometimes not.  In the case above I
would say from the context there is no good reason for anyone a priori to
claim that the derivative at temperature 0 should be zero, that is, that
the curve should necessarily be flat at that rather arbitrary point.  

> If |t| is small for the 'slope of the curve at temperature zero', doesn't
> that just mean that 'slope of the curve at temperature zero' is not
> significantly different from 0 

yes it does, but it is most likely not significantly different from 0 in a
range of temperature values near 0degC as well, where should you constrain
the curve to be flat?  

It comes back to the question of where would someone have good reason a
priori to pose the question "Is the curve flat HERE, and this very special
temperature?"  If there is no good reason to pose the question, why force
the model to conform to this arbitrary restrictrion?  Notice that this is
not quite the same thing as variable selection where Occam's principle is
the good reason for considering whether or not coefficients are zero.  

> and that I had better set it to 0, i.e. omit
> the linear term?

Ahh ... and you were doing so well, too!  Shocking as it may sound,
non-significance, by itself, is not a good enough reason to omit terms in a
regression, (provided of course you had a good reason for including them in
the first place).

> My only explanation for this is, that R somehow "detects" polynomial
> expressions in model formulae and treats them specially.

No, it doesn't but it would be nice if it could.  The same sort of
consideration comes into play when you have factor models with
interactions: no main effect term is removed when a higher way interaction
involving it is still present in the model.  This is exactly the same
principle at work.

> Could anybody tell me a bit more on this subject?

Only that it is often called "the marginality principle", it has caused
endless, heated and ultimately futile debates in the past, and that when
you finally get to see why it makes sense to think this way you immediately
see the exceptions and you start to get a deeper understanding of what
significance tests and model selection are all about.  

What strikes me as the crucial question in model selection problems like
this is "what group of transformations of the regressor variables should
the model selection process be invariant with respect to?"  For simple
polynomial regressions it is often (but certainly not always) reasonable to
require the model selection process to be invariant with respect to changes
of origin and scale in the predictor.  This immediately tells you that you
should at every stage be considering the leading (highest degree)
coefficient only and not those lower down, since by a change of origin and
scale you leave the highest degree coefficient t-statistic invariant, but
you can make any of the other t-statistics just about any value you damn
well like, and certainly zero.  Similarly with spatial regressions it may
be reasonable to require that the selection process be invariant with
respect to affine transformations of latitude and longitude, and, of
course, sometimes not.

Think about it and then, usually, I would say think some more...

Bill Venables.

> Michael Pronath
>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
Bill Venables, Statistician                         Tel. +61 7 3826 7251 
CSIRO Marine Laboratories,                          Fax. +61 7 3826 7304
Cleveland, Qld, 4163                  Email: Bill.Venables at cmis.csiro.au
AUSTRALIA                        http://www.cmis.csiro.au/bill.venables/

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list