[R] Change in 'solve' for r-patched

Thomas Lumley tlumley at u.washington.edu
Sat Nov 1 01:47:28 CET 2003

```On Fri, 31 Oct 2003, Williams, Elliot - BLS wrote:
>
> None of these address the problem of numerical accuracy, which was the
> thrust of Doug's comments.  Has anyone done a quick simulation to
> investigate the stability of the solutions?  Is it small coefficients or
> near-collinearity (or both) that one has to worry about?
>
> Are the solve(crossprod()) methods obviously unstable?  They surely do work
> quickly!
>

They are usually not *very* unstable.  That is, solve(crossprod())
potentially loses about twice as many digits to rounding error as Doug's
proposals. In many cases this is not that serious -- if you have 15 digits
accuracy then losing 2 instead of 1 (or even 8 instead of 4) of them may
not be a big deal.  Back in the days of single precision this was more
obviously important and you would get visible differences between
statistical packages based on their linear algebra routines. (There was a
notorious econometric data set that caused problems in single precision
with the solve(crossprod()) approach but was fine with QR decomposition).

On the other hand, even with double precision in extreme situations the
harm can be noticeable, and the cost of working with QR or Cholesky
decompositions instead is fairly small (and often negative -- with larger
matrices you would probably see the cholesky-based method being faster)

In linear regression the problem happens with collinearity, or when one
variable has much smaller variance than another.  In both of these cases
the eigenvalues of the information matrix are of very different sizes. The
ratio of the smallest to the largest eigenvalue is called the condition
number, and the tolerance in solve() is based on condition numbers.

So how bad is the error? For any matrix M there exists some vector b and
error e in b so that the relative error in solve(M,b+e) compared to
solve(M,b) is the condition number of the matrix M times the relative
error in b.  This means that condition numbers greater than 10^7 (and
finding these in package examples is what prompted Doug's message) can
amplify numerical error ten million times.  This is starting to get
undesirable even when you have 15 digits to work with.

Working with very ill-conditioned matrices is possible, but it requires
care in tracking and bounding errors, and if I wanted to do that I'd be in
applied math, not biostatistics.  That's why a simple fix that reduces
condition numbers from, say, 10^10 to 10^5 really is worthwhile,
especially for software that will be used by other people who don't
understand its internals.

-thomas

```