[R] Trouble about the interpretation of intercept in lm models
Marc Schwartz
marc_schwartz at comcast.net
Tue Jan 13 18:03:12 CET 2009
on 01/13/2009 09:32 AM Stefano Leonardi wrote:
> Hallo,
> yesterday I was puzzled when I discovered that I
> probabliy miss something in the interepretation of intercept
> in two-way lm models.
>
> I thought that the intercept, using the default contr.treatment
> contrasts, represents the mean of the group of observations
> having zero in all column of the model.matrix.
> It turns out not to be case
>
> To be more more clear I am attaching a short example:
>
> R>#this is to generate the dataset
> R>set.seed(1) #to have the same results I have
> R>B <-rep(c("a","b","c"),c(6,6,6))
> R>nB <- 1:3
> R>names(nB)<-c("a","b","c")
> R>B <- as.factor(B)
> R>A<-rep(rep(c("0","1"),c(3,3)),3)
> R>A <- as.factor(A)
> R>Y <- 5+ 10*as.numeric(as.character(A)) + 20*nB[as.character(B)]
> + rnorm(18,0,5)
> R>#this is the generated dataset
> R>data.frame(Y,A,B)
> Y A B
> 1 21.86773 0 a
> 2 25.91822 0 a
> 3 20.82186 0 a
> 4 42.97640 1 a
> 5 36.64754 1 a
> 6 30.89766 1 a
> 7 47.43715 0 b
> 8 48.69162 0 b
> 9 47.87891 0 b
> 10 53.47306 1 b
> 11 62.55891 1 b
> 12 56.94922 1 b
> 13 61.89380 0 c
> 14 53.92650 0 c
> 15 70.62465 0 c
> 16 74.77533 1 c
> 17 74.91905 1 c
> 18 79.71918 1 c
>
> R>#table with means
> R>M<-tapply(Y,list(A,B),mean)
>
> R>print(M)
> a b c
> 0 22.86927 48.00256 62.14832
> 1 36.84053 57.66039 76.47119
>
> R>lm(Y ~ A + B)
>
> Call:
> lm(formula = Y ~ A + B)
>
> Coefficients:
> (Intercept) A1 Bb Bc
> 23.53 12.65 22.98 39.45
>
>
> I was expecting that the intercept in the lm output would be equal
> to the top left corner (0-a) of my previous M table (22.86927)
> which is the mean of the first three observations in the
> dataset.
>
> So how do I interpret the intercept 23.53?
>
> I could not relate it to any other mean.
>
> R>mean(Y[A=="0" & B=="a"])
> [1] 22.86927
> R>apply(M,1,mean)
> 0 1
> 44.34005 56.99070
> R>apply(M,2,mean)
> a b c
> 29.85490 52.83148 69.30975
>
>
> The following of course gave me the same results as the lm call
>
> R>X<- model.matrix(~A+B)
> R>b <- solve(t(X) %*% X) %*% t(X) %*% Y
> R>b
> [,1]
> (Intercept) 23.52957
> A1 12.65066
> Bb 22.97658
> Bc 39.45485
>
>
> Thank you in advance for any suggestion.
>
> Stefano
You need to look at the means of the fitted data, not the means of the
raw data.
Thus, using 'DF' as the source data frame:
> LM
Call:
lm(formula = Y ~ A + B, data = DF)
Coefficients:
(Intercept) A Bb Bc
23.53 12.65 22.98 39.45
# Get the model fitted Y values
> fitted(LM)
1 2 3 4 5 6 7 8
23.52957 23.52957 23.52957 36.18023 36.18023 36.18023 46.50615 46.50615
9 10 11 12 13 14 15 16
46.50615 59.15681 59.15681 59.15681 62.98442 62.98442 62.98442 75.63508
17 18
75.63508 75.63508
# cbind them
DF.fitted <- cbind(DF, F.lm = fitted(LM))
> DF.fitted
Y A B F.lm
1 21.86773 0 a 23.52957
2 25.91822 0 a 23.52957
3 20.82186 0 a 23.52957
4 42.97640 1 a 36.18023
5 36.64754 1 a 36.18023
6 30.89766 1 a 36.18023
7 47.43715 0 b 46.50615
8 48.69162 0 b 46.50615
9 47.87891 0 b 46.50615
10 53.47306 1 b 59.15681
11 62.55891 1 b 59.15681
12 56.94922 1 b 59.15681
13 61.89380 0 c 62.98442
14 53.92650 0 c 62.98442
15 70.62465 0 c 62.98442
16 74.77533 1 c 75.63508
17 74.91905 1 c 75.63508
18 79.71918 1 c 75.63508
# Now get the means of the fitted values across
# the combinations of A and B
M <- with(DF.fitted, tapply(F.lm, list(A = A, B = B), mean))
> M
B
A a b c
0 23.52957 46.50615 62.98442
1 36.18023 59.15681 75.63508
Thus:
# Intercept = *fitted* mean at A = 0; B = "a"
> M["0", "a"]
[1] 23.52957
# A
> M["1", "a"] - M["0", "a"]
[1] 12.65066
# Bb
> M["1", "b"] - M["1", "a"]
[1] 22.97658
# Bc
> M["1", "c"] - M["1", "a"]
[1] 39.45485
Alternatively, for the intercept:
> predict(LM, newdata = data.frame(A = 0, B = "a"))
1
23.52957
See ?fitted and ?predict.lm
HTH,
Marc Schwartz
More information about the R-help
mailing list