[R] Seemingly simple "lm" giving unexpected results

Mon Apr 16 16:55:48 CEST 2012

Instead of changing the tolerance of the qr decomposition
you could use poly(x1,df=1) to center and scale x1.  The coefficients
will be different but the predictions will be fine (because the
centering and scaling information is stored in the fitted model
as the "predvars" attribute of the terms component):
  > predict(lm(y~poly(x1,df=1)), newdata=list(x1=min(x1)+10))
         1   
   129.4292
  > attr(terms(lm(y~poly(x1,df=1))), "predvars")
  list(y, poly(x1, df = 1, coefs = list(alpha = 1334009455.477, 
      norm2 = c(1, 9, 9131.28718957214))))
vs. doing the offset by hand
  > x2 <- x1 - min(x1)
  > predict(lm(y~x2), newdata=list(x2=10))
         1 
  129.4292

I've wondered if lm() should automatically precondition variables
of any of the time-related classes by removing the minimum,
and perhaps divide by a scale factor.  Big offsets are a common
problem with time variables.  (Your x1 is close to the number
of seconds from 1970 until last week, what as.numeric(POSIXct)
would give).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Gene Leynes
> Sent: Monday, April 16, 2012 12:42 AM
> To: peter dalgaard
> Cc: r-help
> Subject: Re: [R] Seemingly simple "lm" giving unexpected results
> 
> Peter,
> 
> This is exactly the answer I was wanted.
> 
> 1) I was a little fuzzy on how the qr decomposition was creating the "error"
> 2) I wanted to know if it was possible to just change a setting to get
> around the "error".  Changing the tol in lm makes a lot more sense to me
> than changing the global eps.
> 
> Also, I love the "<lecture on numerical linear algebra>" tags.  I wish that
> conceptual structure was a common convention in human language.
> 
> Thanks,
> 
> Gene
> 
> On Sat, Apr 14, 2012 at 2:45 PM, peter dalgaard <pdalgd at gmail.com> wrote:
> 
> >
> > On Apr 14, 2012, at 14:40 , Berend Hasselman wrote:
> >
> > >
> > > On 13-04-2012, at 22:20, Gene Leynes wrote:
> > >
> > >> I can't figure out why this is returning an NA for the slope in one
> > case,
> > >> but not in the other.
> > >>
> > >> I can tell that R thinks the first case is singular, but why isn't the
> > >> second?
> > >>
> > >> ## Define X and Y
> > >> ## There are two versions of x
> > >> ##     1) "as is"
> > >> ##     2) shifted to start at 0
> > >> y  = c(58, 57, 57, 279, 252, 851, 45, 87, 47)
> > >> x1 = c(1334009411.437, 1334009411.437, 1334009411.437, 1334009469.297,
> > >>       1334009469.297, 1334009469.297, 1334009485.697, 1334009485.697,
> > >> 1334009485.697)
> > >> x2 = x1 - min(x1)
> > >>
> > >> ## Why doesn't the LM model work for the "as is" x?
> > >> lm(y~x1)
> > >> lm(y~x2)
> > >
> > >
> > > With your data  the matrix t(X)%*%X is extremely ill-conditioned for the
> > case with x1.
> > > See
> > >
> > > http://en.wikipedia.org/wiki/Condition_number
> > > http://mathworld.wolfram.com/ConditionNumber.html
> > > http://en.wikipedia.org/wiki/Numerical_methods_for_linear_least_squares
> > >
> > > You can check it with
> > >
> > > makeXX <- function(x) {
> > >    matrix(data=c(x,rep(1,length(x))),nrow=length(x), byrow=FALSE)
> > > }
> > >
> > > X1 <- makeXX(x1)
> > > (XTX1 <- t(X1) %*% X1)
> > > svd(XTX1)
> > >
> > > and similar for x2.
> >
> > <lecture on numerical linear algebra>
> >
> > lm() is actually a bit smarter than to use the textbook formula
> > (X'X)^{-1}X'Y, it is internally based on a QR decomposition which is
> > numerically far more stable. What it does is effectively to orthogonalize
> > the columns of X successively.
> >
> > > x <- cbind(1, x1)
> > > qr(x)
> > $qr
> >                            x1
> >  [1,] -3.0000000 -4.002028e+09
> >  [2,]  0.3333333  9.555777e+01
> >  [3,]  0.3333333  3.456548e-01
> >  [4,]  0.3333333 -2.598428e-01
> >  [5,]  0.3333333 -2.598428e-01
> >  [6,]  0.3333333 -2.598428e-01
> >  [7,]  0.3333333 -4.314668e-01
> >  [8,]  0.3333333 -4.314668e-01
> >  [9,]  0.3333333 -4.314668e-01
> >
> > $rank
> > [1] 1
> >
> > $qraux
> > [1] 1.333333 1.345655
> >
> > $pivot
> > [1] 1 2
> >
> > attr(,"class")
> > [1] "qr"
> >
> > However, notice the rank of 1. That's because it wants to protect the user
> > against unwittingly plugging in a singular design matrix, so when the
> > algorithm encounters a variable that looks like it is getting reduced to
> > zero by after orthogonalization, it basically throws it out.
> >
> > You can defeat this mechanism by setting the tol= argument sufficiently
> > low. In fact, you can do the same with lm itself:
> >
> > > lm(y ~ x1, tol = 0)
> > ...
> > (Intercept)           x1
> >  -2.474e+09    1.854e+00
> >
> > Notice that the slope is consistent with the
> >
> > > lm(y ~ x2)
> > ...
> > (Intercept)           x2
> >    110.884        1.854
> >
> > In contrast if you try the textbook way:
> >
> > > solve(crossprod(x), crossprod(x,y))
> > Error in solve.default(crossprod(x), crossprod(x, y)) :
> >  system is computationally singular: reciprocal condition number =
> > 2.67813e-34
> >
> > Ah, well, let's try and defeat that:
> >
> > > solve(crossprod(x), crossprod(x,y), tol=0)
> >            [,1]
> >   -2.959385e+09
> > x1  2.218414e+00
> >
> > Which, as you'll notice is, er, somewhat off the mark.
> >
> > </lecture on numerical linear algebra>
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.