[R] Condition indexes and variance inflation factors

Thu Jul 24 15:57:24 CEST 2003

Dear Peter,

At 08:24 AM 7/24/2003 -0400, Peter Flom wrote:

>(1) I've never liked this approach for a model with a constant, where
>it
>makes more sense to me to centre the data. I realize that opinions
>differ
>here, but it seems to me that failing to centre the data conflates
>collinearity with numerical instability.
> >>>
>
>Opinions do differ.  A few years ago, I could have given more details
>(my dissertation was on this topic, but a lot of the details have
>disappeared from memory); I think, though, that Belsley is looking for a
>measure that deals not only with collinearity, but with several other
>problems, including numerical instability (the subtitle of his later
>book is Collinearity and Weak Data in Regression).  I remember being
>convinced that centering was generally not a good idea, but there are
>lots of people who disagree and who know a lot more statistics than I
>do.

To elaborate my remark slightly, in most problems the intercept is not of 
much interest. When the data are far from the origin, it's natural that the 
intercept isn't well estimated. When data are very far from the origin, 
computations with the uncentred data may be numerically unstable (depending 
upon how the computations are done) because of "collinearity with the 
intercept." If the real interest is in the coefficients other than the 
intercept, this seems to me purely a numerical artefact. The possibly more 
generally interesting sense of "collinearity" is imprecision in estimation 
due to strong relationships among the predictors.

. . .

><<<
>(4) I have doubts about the whole enterprise. Collinearity is one
>source of
>imprecision -- others are small sample size, homogeneous predictors,
>and
>large error variance. Aren't the coefficient standard errors the bottom
>
>line? If these are sufficiently small, why worry?
> >>>
>
>I think (correct me if I am wrong) that the s.e.s and the condition
>indices serve very different purposes.  The condition indices are
>supposed to determine if small changes in the input data could make big
>differences in the results.  Belsley provides some examples where a tiny
>change in the data results in completely different results (e.g.,
>different standard errors, different coefficients (even reversing sign)
>and so on).

Indeed, ill-conditioned data produce unstable numerical solutions (even 
affected by how the data are rounded), but condition indices aren't a 
particularly effective way of looking for instability in a more general 
sense. Consider, for example, Anscombe's famous simple-regression examples, 
which are in the data frame Quartet in the car package. The fourth example 
has a highly influential data point (number 8):

 > Quartet[, c("x4", "y4")]
    x4    y4
1   8  6.58
2   8  5.76
3   8  7.71
4   8  8.84
5   8  8.47
6   8  7.04
7   8  5.25
8  19 12.50
9   8  5.56
10  8  7.91
11  8  6.89

The regression of y4 on x4 isn't especially ill-conditioned (using the 
function I posted yesterday):

 > mod <- lm(y4 ~ x4)
 > belsley(mod)

Singular values:  1.394079 0.2377891
Condition indices:  1 5.86267

Variance-decomposition proportions
   (Intercept)    x4
1       0.028 0.028
2       0.972 0.972

but the 8th observation has an infinite Cook's D:

 > round(cooks.distance(mod), 2)
    1    2    3    4    5    6    7    8    9   10   11
0.01 0.06 0.02 0.14 0.09 0.00 0.12  Inf 0.08 0.03 0.00

Regards,
  John
-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox