[R] Seemingly simple "lm" giving unexpected results
peter dalgaard
pdalgd at gmail.com
Sat Apr 14 21:45:46 CEST 2012
On Apr 14, 2012, at 14:40 , Berend Hasselman wrote:
>
> On 13-04-2012, at 22:20, Gene Leynes wrote:
>
>> I can't figure out why this is returning an NA for the slope in one case,
>> but not in the other.
>>
>> I can tell that R thinks the first case is singular, but why isn't the
>> second?
>>
>> ## Define X and Y
>> ## There are two versions of x
>> ## 1) "as is"
>> ## 2) shifted to start at 0
>> y = c(58, 57, 57, 279, 252, 851, 45, 87, 47)
>> x1 = c(1334009411.437, 1334009411.437, 1334009411.437, 1334009469.297,
>> 1334009469.297, 1334009469.297, 1334009485.697, 1334009485.697,
>> 1334009485.697)
>> x2 = x1 - min(x1)
>>
>> ## Why doesn't the LM model work for the "as is" x?
>> lm(y~x1)
>> lm(y~x2)
>
>
> With your data the matrix t(X)%*%X is extremely ill-conditioned for the case with x1.
> See
>
> http://en.wikipedia.org/wiki/Condition_number
> http://mathworld.wolfram.com/ConditionNumber.html
> http://en.wikipedia.org/wiki/Numerical_methods_for_linear_least_squares
>
> You can check it with
>
> makeXX <- function(x) {
> matrix(data=c(x,rep(1,length(x))),nrow=length(x), byrow=FALSE)
> }
>
> X1 <- makeXX(x1)
> (XTX1 <- t(X1) %*% X1)
> svd(XTX1)
>
> and similar for x2.
<lecture on numerical linear algebra>
lm() is actually a bit smarter than to use the textbook formula (X'X)^{-1}X'Y, it is internally based on a QR decomposition which is numerically far more stable. What it does is effectively to orthogonalize the columns of X successively.
> x <- cbind(1, x1)
> qr(x)
$qr
x1
[1,] -3.0000000 -4.002028e+09
[2,] 0.3333333 9.555777e+01
[3,] 0.3333333 3.456548e-01
[4,] 0.3333333 -2.598428e-01
[5,] 0.3333333 -2.598428e-01
[6,] 0.3333333 -2.598428e-01
[7,] 0.3333333 -4.314668e-01
[8,] 0.3333333 -4.314668e-01
[9,] 0.3333333 -4.314668e-01
$rank
[1] 1
$qraux
[1] 1.333333 1.345655
$pivot
[1] 1 2
attr(,"class")
[1] "qr"
However, notice the rank of 1. That's because it wants to protect the user against unwittingly plugging in a singular design matrix, so when the algorithm encounters a variable that looks like it is getting reduced to zero by after orthogonalization, it basically throws it out.
You can defeat this mechanism by setting the tol= argument sufficiently low. In fact, you can do the same with lm itself:
> lm(y ~ x1, tol = 0)
...
(Intercept) x1
-2.474e+09 1.854e+00
Notice that the slope is consistent with the
> lm(y ~ x2)
...
(Intercept) x2
110.884 1.854
In contrast if you try the textbook way:
> solve(crossprod(x), crossprod(x,y))
Error in solve.default(crossprod(x), crossprod(x, y)) :
system is computationally singular: reciprocal condition number = 2.67813e-34
Ah, well, let's try and defeat that:
> solve(crossprod(x), crossprod(x,y), tol=0)
[,1]
-2.959385e+09
x1 2.218414e+00
Which, as you'll notice is, er, somewhat off the mark.
</lecture on numerical linear algebra>
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help
mailing list