[R] Error in lm() with very small (close to zero) regressor

Sat Mar 28 18:28:22 CET 2015

peter dalgaard <pdalgd <at> gmail.com> writes:

> 
> 
> > On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at> business.uzh.ch> wrote:
> > 
> > Hello everybody,
> > 
> > I have encountered the following problem with lm():
> > 
> > When running lm() with a regressor close to zero - 
> of the order e-10, the
> > value of the estimate is of huge absolute value , of order millions. 
> > 
> > However, if I write the formula of the OLS estimator, 
> in matrix notation:
> > pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning the
> > estimate has value 0.

  How do you know this answer is "correct"?

> > here is the code:
> > 
> > y  <- rnorm(n_obs, 10,2.89)
> > x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
> > x2 <- rnorm(n_obs, 10,3.21)
> > X  <- cbind(x1,x2)
> > 
> > bFE <- lm(y ~ x1 + x2)
> > bFE
> > 
> > bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
> > bOLS
> > 
> > 
> > Note: I am applying a deviation from the 
> mean projection to the data, that
> > is why I have some regressors with such small values.
> > 
> > Thank you for any help!
> > 
> > Raluca

  Is there a reason you can't scale your regressors?

> > 
> Example not reproducible:
> 

  I agree that the OP's question was not reproducible, but it's
not too hard to make it reproducible. I bothered to use
library("sos"); findFn("pseudoinverse") to find pseudoinverse()
in corpcor:

It is true that we get estimates with very large magnitudes,
but their 

set.seed(101)
n_obs <- 1000
y  <- rnorm(n_obs, 10,2.89)
x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17)
x2 <- rnorm(n_obs, 10,3.21)
X  <- cbind(x1,x2)

bFE <- lm(y ~ x1 + x2)
bFE
coef(summary(bFE))

                 Estimate   Std. Error     t value  Pr(>|t|)
(Intercept)  1.155959e+01 2.312956e+01  0.49977541 0.6173435
x1          -1.658420e+14 1.872598e+15 -0.08856254 0.9294474
x2           3.797342e-02 2.813000e-02  1.34992593 0.1773461

library("corpcor")
bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
bOLS

             [,1]
[1,] 9.807664e-16
[2,] 8.880273e-01

And if we scale the predictor:

bFE2 <- lm(y ~ I(1e14*x1) + x2)
coef(summary(bFE2))

                 Estimate Std. Error     t value  Pr(>|t|)
(Intercept)   11.55958731   23.12956  0.49977541 0.6173435
I(1e+14 * x1) -1.65842037   18.72598 -0.08856254 0.9294474
x2             0.03797342    0.02813  1.34992593 0.1773461

bOLS stays constant.

To be honest, I haven't thought about this enough to see
which answer is actually correct, although I suspect the
problem is in bOLS, since the numerical methods (unlike
the brute-force pseudoinverse method given here) behind
lm have been carefully considered for numerical stability.