[R] Condition indexes and variance inflation factors

Sun Jul 27 14:48:46 CEST 2003

Dear Peter,

I'm sorry that I've taken a while to get back to you -- I was away for a 
few days.

In the example that you give from Belsley (1991), the predictors are 
essentially perfectly linearly related; for example

     > summary(lm(x2a ~ x3a + x4a))

     Call:
     lm(formula = x2a ~ x3a + x4a)

     Residuals:
             1          2          3          4          5          6 
    7
     -0.0195624 -0.0152938  0.0078068  0.0323025 
-0.0087845  0.0025448  0.0014472
             8
     -0.0004606

     Coefficients:
                 Estimate Std. Error  t value Pr(>|t|)
     (Intercept)  0.007943   0.010025    0.792    0.464
     x3a         -6.181811   0.016069 -384.716 2.25e-12 ***
     x4a         28.540996   0.066907  426.580 1.34e-12 ***
     ---
     Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

     Residual standard error: 0.01901 on 5 degrees of freedom
     Multiple R-Squared:     1,      Adjusted R-squared:     1
     F-statistic: 1.033e+05 on 2 and 5 DF,  p-value: 2.879e-12

In a case like this, the variance-inflation factors will also be very large:

 > vif(lm(y~ x2a + x3a + x4a))
      x2a      x3a      x4a
41333.34 47141.19 57958.62

Any of several methods of discovering the linear relationship among the x's 
will work -- including the first regression above, a principal-components 
analysis, and Belsley's approach.

I'm not arguing that discovering the source of large standard errors in a 
regression is completely uninteresting, although in most circumstances 
there isn't much that one can do about it short of collecting new data -- 
but this probably isn't a proper forum to have a detailed discussion about 
collinearity (my fault for broaching the issue in the first place).

Except with respect to centering the data, I suspect that we largely agree 
about these matters.

Regards,
  John

At 11:03 AM 7/24/2003 -0400, Peter Flom wrote:
>Dear John
>
>An interesting discussion!
>
>I would be the last to suggest ignoring such diagnostics as Cook's D;
>as you point out, it diagnoses a problem which condition indices do not:
>Whether a point is influential.
>
>OTOH, condition indices diagnose a problem which Cook's D does not:
>Would shifting the data slightly change the results.
>
>Consider the data given in Belsley (1991) on p. 5
>
>y <-   c( 3.3979, 1.6094, 3.7131, 1.6767, 0.0419, 3.3768, 1.1661,
>0.4701)
>x2a <- c(-3.138, -0.297, -4.582, 0.301, 2.729, -4.836, 0.065, 4.102)
>x2b <- c(-3.136, -0.296, -4.581, 0.300, 2.730, -4.834, 0.064, 4.103)
>x3a <- c(1.286, 0.250, 1.247, 0.498, -0.280, 0.350, 0.208, 1.069)
>x3b <- c(1.288, 0.251, 1.246, 0.498, -0.281, 0.349, 0.206, 1.069)
>x4a <- c(0.169, 0.044, 0.109, 0.117, 0.035, -0.094, 0.047, 0.375)
>x4b <- c(0.170, 0.043, 0.108, 0.118, 0.036, -0.093, 0.048, 0.376)
>
>where x2a , x3a and x4a are very similar to x2b, x3b and x4b,
>respecttively, and where both are generated from
>
>y = 1.2I  - 0.4 x2 + 0.6x3 + 0.9x4 + e
>
>e ~ N(0, 0.01)
>
>Then
>modela <- lm(y~ x2a + x3a + x4a)
>and
>modelb <- lm(y~x2b + x3b + x4b)
>
>give radically different results, with neither having any significant
>parameters other than the intercept.  Admittedly, the standard errors
>for a couple of the parameters are large.  But why are they large? I
>have certainly dealt with models with large standard errors that have
>nothing to do with collinearity.
>
>here, the function PI.lm (supplied by Juergen Gross) gives huge
>condition indices, and indicates that the nature of the problem is that
>all three of the x variables are highly collinear.
>
>Variance-Decomposition Proportions for
>Scaled Condition Indexes:
>
>     (Intercept) x2b x3b x4b
>1        0.0494   0   0   0
>1        0.0009   0   0   0
>3        0.8101   0   0   0
>464      0.1396   1   1   1
>

-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox