Stats 101 : lm with/without intercept

Duncan Murdoch murdoch at stats.uwo.ca
Sat Sep 22 00:00:05 CEST 2007

On 21/09/2007 4:47 PM, Yves Moisan wrote:
> I am puzzled at the use of regression.  I have a categorical variable
> ClassePop33000 which factors a Population variable into 3 levels.  I want to
> investigate whether that categorical variable has some relation with my
> dependent variable, so I go :
> lm(Cout.ton ~ ClassePop33000, data=ech2)
> Call:
> lm(formula = Cout.ton ~ ClassePop33000, data = ech2)
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -182.24  -62.91  -22.76   66.38  277.39 
> Coefficients:
>                                    Estimate Std. Error t value Pr(>|t|)    
> (Intercept)                          231.66      11.50  20.141  < 2e-16 ***
> ClassePop33000[T.[3000,25000)]       -72.91      16.70  -4.366 2.19e-05 ***
> ClassePop33000[T.[25000,10000000)]   -95.17      19.92  -4.777 3.82e-06 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> Residual standard error: 97.6 on 170 degrees of freedom
> Multiple R-Squared: 0.1502,     Adjusted R-squared: 0.1402 
> F-statistic: 15.02 on 2 and 170 DF,  p-value: 9.818e-07 
> Now I discovered one could omit the intercept and therefore have
> coefficients for the N levels of the categorical variable.  So I went :
> lm(Cout.ton ~ ClassePop33000 + 0, data=ech2)
> Call:
> lm(formula = Cout.ton ~ ClassePop33000 + 0, data = ech2)
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -182.24  -62.91  -22.76   66.38  277.39 
> Coefficients:
>                                Estimate Std. Error t value Pr(>|t|)    
> ClassePop33000[1,3000)           231.66      11.50  20.141  < 2e-16 ***
> ClassePop33000[3000,25000)       158.75      12.11  13.114  < 2e-16 ***
> ClassePop33000[25000,10000000)   136.49      16.27   8.391  1.8e-14 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> Residual standard error: 97.6 on 170 degrees of freedom
> Multiple R-Squared: 0.7922,     Adjusted R-squared: 0.7885 
> F-statistic:   216 on 3 and 170 DF,  p-value: < 2.2e-16 
> I tried the very pedagogical examples at
> http://www.stat.umn.edu/geyer/5102/examp/dummy.html and plotting the
> regression lines with abline gives me the exact same lines whether I use
> with or without intercept.  Now why do R squared differ then ?  At least the
> p-values are of the same order of magnitude, but I don't understand the
> drastic difference in R squared.  Pointers to stats 101 anyone ?  

The standard definition of R-squared assumes there's an intercept 
present.  If you suppress it, you need to come up with a new definition. 
  So those values aren't comparable.

Duncan Murdoch

