[R] R2 always increases as variables are added?

Thu May 24 21:37:11 CEST 2007

The answer to your question three is that the calculation of r-squared 
in summary.lm does depend on whether or not an intercept is included in 
the model.  (Another part of the reason for you puzzlement is, I think, 
that you are computing R-squared as SSR/SST, which is only valid when 
when the model has an intercept).

The code is in summary.lm, here are the relevant excerpts (assuming your 
model does not have weights):

     r <- z$residuals
     f <- z$fitted
     w <- z$weights
     if (is.null(w)) {
         mss <- if (attr(z$terms, "intercept"))
             sum((f - mean(f))^2)
         else sum(f^2)
         rss <- sum(r^2)
     }
...
     ans$r.squared <- mss/(mss + rss)

If you want to compare models with and without an intercept based on 
R^2, then I suspect it's most appropriate to use the version of R^2 that 
does not use a mean.

It's also worthwhile thinking about what you are actually doing.  I find 
the most intuitive definition of R^2 
(http://en.wikipedia.org/wiki/R_squared) is

R2 = 1 - SSE / SST

where SSE = sum_i (yhat_i - y_i)^2, (sum of errors in predictions for 
you model)
and SST = sum_i (y_i - mean(y))^2 (sum of errors in predictions for an 
intercept-only model)

This means that the standard definition of R2 effectively compares the 
model with an intercept-only model.  As the error in predictions goes 
down, R2 goes up, and the model that uses the mean(y) as a prediction 
(i.e., the intercept-only model) provides a scale for these errors.

If you think or know that the true mean of y is zero then it may be 
appropriate to compare against a zero model rather than an 
intercept-only model (in SST).  And if the sample mean of y is quite 
different from zero, and you compare a no-intercept model against an 
intercept-only model, then you're going to get results that are not 
easily interpreted.

Note that a common way of expressing and computing R^2 is as SSR/SST 
(which you used).  (Where SSR = sum_i (yhat_i - mean(y))^2 ). However, 
this is only valid when the model has an intercept (i.e., SSR/SST = 1 - 
SSE/SST ONLY when the model has an intercept.)

Here's some examples, based on your example:

 > set.seed(1)
 > data <- data.frame(x1=rnorm(10), x2=rnorm(10), y=rnorm(10), I=1)
 >
 > lm1 <- lm(y~1, data=data)
 > summary(lm1)$r.squared
[1] 0
 > y.hat <- fitted(lm1)
 > sum((y.hat-mean(data$y))^2)/sum((data$y-mean(data$y))^2)
[1] 5.717795e-33
 >
 > # model with no intercept
 > lm2 <- lm(y~x1+x2-1, data=data)
 > summary(lm2)$r.squared
[1] 0.6332317
 > y.hat <- fitted(lm2)
 > # no-intercept version of R^2 (2 ways to compute)
 > 1-sum((y.hat-data$y)^2)/sum((data$y)^2)
[1] 0.6332317
 > sum((y.hat)^2)/sum((data$y)^2)
[1] 0.6332317
 > # standard (assuming model has intercept) computations for R^2:
 > SSE <- sum((y.hat - data$y)^2)
 > SST <- sum((data$y - mean(data$y))^2)
 > SSR <- sum((y.hat - mean(data$y))^2)
 > 1 - SSE/SST
[1] 0.6252577
 > # Note that SSR/SST != 1 - SSE/SST (because the model doesn't have an 
intercept)
 > SSR/SST
[1] 0.6616612
 >
 > # model with intercept included in data
 > lm3 <- lm(y~x1+x2+I-1, data=data)
 > summary(lm3)$r.squared
[1] 0.6503186
 > y.hat <- fitted(lm3)
 > # no-intercept version of R^2 (2 ways to compute)
 > 1-sum((y.hat-data$y)^2)/sum((data$y)^2)
[1] 0.6503186
 > sum((y.hat)^2)/sum((data$y)^2)
[1] 0.6503186
 > # standard (assuming model has intercept) computations for R^2:
 > SSE <- sum((y.hat - data$y)^2)
 > SST <- sum((data$y - mean(data$y))^2)
 > SSR <- sum((y.hat - mean(data$y))^2)
 > 1 - SSE/SST
[1] 0.6427161
 > SSR/SST
[1] 0.6427161
 >
 >

hope this helps,

Tony Plate

Disclaimer: I too do not have any degrees in statistics, but I'm 95% 
sure the above is mostly correct :-)  If there are any major mistakes, 
I'm sure someone will point them out.

??? wrote:
> Hi, everybody,
> 
> 3 questions about R-square:
> ---------(1)----------- Does R2 always increase as variables are added?
> ---------(2)----------- Does R2 always greater than 1?
> ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> calculated? It is different from (r.square=sum((y.hat-mean
> (y))^2)/sum((y-mean(y))^2))
> 
> I will illustrate these problems by the following codes:
> ---------(1)-----------  R2  doesn't always increase as variables are added
> 
>> x=matrix(rnorm(20),ncol=2)
>> y=rnorm(10)
>>
>> lm=lm(y~1)
>> y.hat=rep(1*lm$coefficients,length(y))
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 2.646815e-33
>> lm=lm(y~x-1)
>> y.hat=x%*%lm$coefficients
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 0.4443356
>> ################ This is the biggest model, but its R2 is not the biggest,
> why?
>> lm=lm(y~x)
>> y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 0.2704789
> 
> 
> ---------(2)-----------  R2  can greater than 1
> 
>> x=rnorm(10)
>> y=runif(10)
>> lm=lm(y~x-1)
>> y.hat=x*lm$coefficients
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 3.513865
> 
> 
>  ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> calculated? It is different from (r.square=sum((y.hat-mean
> (y))^2)/sum((y-mean(y))^2))
>> x=matrix(rnorm(20),ncol=2)
>> xx=cbind(rep(1,10),x)
>> y=x%*%c(1,2)+rnorm(10)
>> ### r2 calculated by lm(y~x)
>> lm=lm(y~x)
>> summary(lm)$r.squared
> [1] 0.9231062
>> ### r2 calculated by lm(y~xx-1)
>> lm=lm(y~xx-1)
>> summary(lm)$r.squared
> [1] 0.9365253
>> ### r2 calculated by me
>> y.hat=xx%*%lm$coefficients
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 0.9231062
> 
> 
> Thanks a lot for any cue:)
> 
> 
> 
>