[R] Condition indexes and variance inflation factors
John Fox
jfox at mcmaster.ca
Sun Jul 27 14:48:46 CEST 2003
Dear Peter,
I'm sorry that I've taken a while to get back to you -- I was away for a
few days.
In the example that you give from Belsley (1991), the predictors are
essentially perfectly linearly related; for example
> summary(lm(x2a ~ x3a + x4a))
Call:
lm(formula = x2a ~ x3a + x4a)
Residuals:
1 2 3 4 5 6
7
-0.0195624 -0.0152938 0.0078068 0.0323025
-0.0087845 0.0025448 0.0014472
8
-0.0004606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.007943 0.010025 0.792 0.464
x3a -6.181811 0.016069 -384.716 2.25e-12 ***
x4a 28.540996 0.066907 426.580 1.34e-12 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.01901 on 5 degrees of freedom
Multiple R-Squared: 1, Adjusted R-squared: 1
F-statistic: 1.033e+05 on 2 and 5 DF, p-value: 2.879e-12
In a case like this, the variance-inflation factors will also be very large:
> vif(lm(y~ x2a + x3a + x4a))
x2a x3a x4a
41333.34 47141.19 57958.62
Any of several methods of discovering the linear relationship among the x's
will work -- including the first regression above, a principal-components
analysis, and Belsley's approach.
I'm not arguing that discovering the source of large standard errors in a
regression is completely uninteresting, although in most circumstances
there isn't much that one can do about it short of collecting new data --
but this probably isn't a proper forum to have a detailed discussion about
collinearity (my fault for broaching the issue in the first place).
Except with respect to centering the data, I suspect that we largely agree
about these matters.
Regards,
John
At 11:03 AM 7/24/2003 -0400, Peter Flom wrote:
>Dear John
>
>An interesting discussion!
>
>I would be the last to suggest ignoring such diagnostics as Cook's D;
>as you point out, it diagnoses a problem which condition indices do not:
>Whether a point is influential.
>
>OTOH, condition indices diagnose a problem which Cook's D does not:
>Would shifting the data slightly change the results.
>
>Consider the data given in Belsley (1991) on p. 5
>
>y <- c( 3.3979, 1.6094, 3.7131, 1.6767, 0.0419, 3.3768, 1.1661,
>0.4701)
>x2a <- c(-3.138, -0.297, -4.582, 0.301, 2.729, -4.836, 0.065, 4.102)
>x2b <- c(-3.136, -0.296, -4.581, 0.300, 2.730, -4.834, 0.064, 4.103)
>x3a <- c(1.286, 0.250, 1.247, 0.498, -0.280, 0.350, 0.208, 1.069)
>x3b <- c(1.288, 0.251, 1.246, 0.498, -0.281, 0.349, 0.206, 1.069)
>x4a <- c(0.169, 0.044, 0.109, 0.117, 0.035, -0.094, 0.047, 0.375)
>x4b <- c(0.170, 0.043, 0.108, 0.118, 0.036, -0.093, 0.048, 0.376)
>
>where x2a , x3a and x4a are very similar to x2b, x3b and x4b,
>respecttively, and where both are generated from
>
>y = 1.2I - 0.4 x2 + 0.6x3 + 0.9x4 + e
>
>e ~ N(0, 0.01)
>
>Then
>modela <- lm(y~ x2a + x3a + x4a)
>and
>modelb <- lm(y~x2b + x3b + x4b)
>
>give radically different results, with neither having any significant
>parameters other than the intercept. Admittedly, the standard errors
>for a couple of the parameters are large. But why are they large? I
>have certainly dealt with models with large standard errors that have
>nothing to do with collinearity.
>
>here, the function PI.lm (supplied by Juergen Gross) gives huge
>condition indices, and indicates that the nature of the problem is that
>all three of the x variables are highly collinear.
>
>Variance-Decomposition Proportions for
>Scaled Condition Indexes:
>
> (Intercept) x2b x3b x4b
>1 0.0494 0 0 0
>1 0.0009 0 0 0
>3 0.8101 0 0 0
>464 0.1396 1 1 1
>
-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox
More information about the R-help
mailing list