[R] R-squared value for linear regression passing through origin using lm()

Fri Oct 19 14:59:49 CEST 2007

Berwin A Turlach, Freitag, 19. Oktober 2007:
> G'day Ralf,

Hi Berwin,

> On Fri, 19 Oct 2007 09:51:37 +0200 Ralf Goertz <R_Goertz at web.de>
> wrote:
> 
> Why should either of those formula yield the output of
> summary(lm(y~x+0)) ?  The R-squared output of that command is
> documented in help(summary.lm):
> 
> r.squared: R^2, the 'fraction of variance explained by the model',
> 
>               R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y*)^2),

yes I know. But you know why I chose those formulas, right?

>           where y* is the mean of y[i] if there is an intercept and
>           zero otherwise.
> 
> And, indeed:
> 
> > 1-sum(residuals(lm(y~x+0))^2)/sum((y-0)^2)
> [1] 0.9796238
> 
> confirms this.
> 
> Note: if you do not have an intercept in your model, the residuals do
> not have to add to zero; and, typically, they will not.  Hence,
> var(residuals(lm(y~x+0)) does not give you the residual sum of squares.

Yes I am right, you know why.

> > In order to save the role of R^2 as a goodness-of-fit indicator 
> 
> R^2 is no goodness-of-fit indicator, neither in models with intercept
> nor in models without intercept.  So I do not see how you can save its
> role as a goodness-of-fit indicator. :)

Okay, I surrender.

> Since you are posting from a .de domain, I assume you will understand
> the following quote from Tutz (2000), "Die Analyse kategorialer Daten",
> page 18:
> 
> R^2 misst *nicht* die Anpassungsguete des linearen Modelles, es sagt
> nichts darueber aus, ob der lineare Ansatz wahr oder falsch ist, sondern
> nur ob durch den linearen Ansatz individuelle Beobachtungen
> vorhersagbar sind.  R^2 wird wesentlich vom Design, d.h. den Werten,
> die x annimmt bestimmt (vgl. Kockelkorn (1998)).  

Danke schön.

> > But I assume that this has probably been discussed at length
> > somewhere more appropriate than r-help.
> 
> I am sure about that, but it was also discussed here on r-help (long
> ago).  The problem is that this compares two models that are not nested
> in each other which is a quite controversial thing to do; some might
> even go so far as saying that it makes no sense at all.  The other
> problem with this approaches is illustrated by my example:
> 
> > set.seed(20070807)
> > x <- runif(100)*2+10
> > y <- 4+rnorm(x, sd=1)
> > 1-var(residuals(lm(y~x+0)))/var(y)
> [1] -0.04848273
> 
> How do you explain that a quantity that is called R-squared, implying
> that it is the square of something, hence always non-negative, can
> become negative?

because the correlation coefficient is either 0.2201879424i or
-0.2201879424i ;)

Thanks for your time, and yours as well, Steve. You've been very
helpful.

Ralf