[R] lm without intercept

Fri Feb 18 14:02:16 CET 2011

No, this is a cute problem, though: the definition of R^2 changes
without the intercept, because the
"empty" model used for calculating the total sums of squares is always
predicting 0 (so the total sums
of squares are sums of squares of the observations themselves, without
centering around the sample
mean).

Your interpretation of the p-value for the intercept in the first
model is also backwards: 0.9535 is extremely
weak evidence against the hypothesis that the intercept is 0.  That
is, the intercept might be near zero, but
could also be something veru different.  With a standard error of 229,
your 95% confidence interval
for the intercept (if you trusted it based on other things) would have
a margin of error of well over 400.  If you
told me that an intercept of, say 350 or 400 were consistent with your
knowledge of the problem, I wouldn't
blink.

This is a very small data set: if you sent an R command such as:

x <- c(x1, x2, ..., xn)
y <- c(y1, y2, ..., yn)

you might even get some more interesting feedback.  One of the many
good intro stats textbooks might
also be helpful as you get up to speed.

Jay
---------------------------------------------
Original post:

Message: 135
Date: Fri, 18 Feb 2011 11:49:41 +0100
From: Jan <jrheinlaender at gmx.de>
To: "R-help at r-project.org list" <r-help at r-project.org>
Subject: [R] lm without intercept
Message-ID: <1298026181.2847.19.camel at jan-laptop>
Content-Type: text/plain; charset="UTF-8"

Hi,

I am not a statistics expert, so I have this question. A linear model
gives me the following summary:

Call:
lm(formula = N ~ N_alt)

Residuals:
   Min      1Q  Median      3Q     Max
-110.30  -35.80  -22.77   38.07  122.76

Coefficients:
           Estimate Std. Error t value Pr(>|t|)
(Intercept)  13.5177   229.0764   0.059   0.9535
N_alt         0.2832     0.1501   1.886   0.0739 .
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 56.77 on 20 degrees of freedom
 (16 observations deleted due to missingness)
Multiple R-squared: 0.151, Adjusted R-squared: 0.1086
F-statistic: 3.558 on 1 and 20 DF,  p-value: 0.07386

The regression is not very good (high p-value, low R-squared).
The Pr value for the intercept seems to indicate that it is zero with a
very high probability (95.35%). So I repeat the regression forcing the
intercept to zero:

Call:
lm(formula = N ~ N_alt - 1)

Residuals:
   Min      1Q  Median      3Q     Max
-110.11  -36.35  -22.13   38.59  123.23

Coefficients:
     Estimate Std. Error t value Pr(>|t|)
N_alt 0.292046   0.007742   37.72   <2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 55.41 on 21 degrees of freedom
 (16 observations deleted due to missingness)
Multiple R-squared: 0.9855, Adjusted R-squared: 0.9848
F-statistic:  1423 on 1 and 21 DF,  p-value: < 2.2e-16

1. Is my interpretation correct?
2. Is it possible that just by forcing the intercept to become zero, a
bad regression becomes an extremely good one?
3. Why doesn't lm suggest a value of zero (or near zero) by itself if
the regression is so much better with it?

Please excuse my ignorance.

Jan Rheinl?nder

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay