[R] VIF's in R using BIGLM

Mon Apr 27 12:22:05 CEST 2009

Dear R-help

This is a follow-up to my previous post here:
http://groups.google.com/group/r-help-archive/browse_thread/thread/d9b6f87ce06a9fb7/e9be30a4688f239c?lnk=gst&q=dobomode#e9be30a4688f239c

I am working on developing an open-source automated system for running
batch-regressions on very large datasets. In my previous post, I posed
the question of obtaining VIF's from the output of BIGLM. With a lot
of help from Assoc. Professor, Biostatistics Thomas Lumley at
University of Washington, I was able to make significant progress, but
ultimately got stuck. The following post describes the steps and
reasoning I undertook in trying to accomplish this task. Please note
that I am not a statistician so ignore any commentary that seems naive
to you.

A quick intro. The goal is to obtain VIF's (variance inflation
factors) from the regression output of BIGLM. Traditionally, this has
been possible with the regular lm() function. Follows a quick
illustration (the model below is pretty silly, only for illustration
purposes).

Example dataset:

> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear
carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4
4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4
4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4
1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3
1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3
2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3
1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3
4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4
2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4
2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4
4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4
4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3
3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3
3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3
3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3
4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3
4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3
4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4
1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4
2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4
1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3
1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3
2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3
2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3
4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3
2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4
1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5
2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5
2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5
4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5
6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5
8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4
2

Example model:

model <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear +
carb

Regression:

reg_lm <- lm(model, mtcars)

Results:

> summary(reg_lm)

Call:
lm(formula = model, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max
-3.4506 -1.6044 -0.1196  1.2193  4.6271

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337   18.71788   0.657   0.5181
cyl         -0.11144    1.04502  -0.107   0.9161
disp         0.01334    0.01786   0.747   0.4635
hp          -0.02148    0.02177  -0.987   0.3350
drat         0.78711    1.63537   0.481   0.6353
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739
vs           0.31776    2.10451   0.151   0.8814
am           2.52023    2.05665   1.225   0.2340
gear         0.65541    1.49326   0.439   0.6652
carb        -0.19942    0.82875  -0.241   0.8122
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared: 0.869,	Adjusted R-squared: 0.8066
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

VIF's:

> vif(reg_lm)
      cyl      disp        hp      drat        wt      qsec
vs        am      gear      carb
15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873
4.648487  5.357452  7.908747

Here is the definition of vif() (courtesy of
http://www.stat.sc.edu/~hitchcock/bodyfatRexample.txt):

> vif.lm
function(object, ...) {
  V <- summary(object)$cov.unscaled
  Vi <- crossprod(model.matrix(object))
        nam <- names(coef(object))
  if(k <- match("(Intercept)", nam, nomatch = F)) {
                v1 <- diag(V)[-k]
                v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k])
                nam <- nam[-k]
        } else {
                v1 <- diag(V)
                v2 <- diag(Vi)
                warning("No intercept term detected. Results may
surprise.")
        }
        structure(v1*v2, names = nam)
}

My understanding is that the function uses the unscaled variance-
covariance matrix V and the model matrix Vi to compute the VIF's based
on the equations above.

Now, attempting to do the same using biglm().

> reg_biglm <- biglm(model, mtcars)
> summary(reg_biglm)
Large data regression model: biglm(model, mtcars)
Sample size =  32
              Coef    (95%    CI)     SE     p
(Intercept) 12.303 -25.132 49.739 18.718 0.511
cyl         -0.111  -2.201  1.979  1.045 0.915
disp         0.013  -0.022  0.049  0.018 0.455
hp          -0.021  -0.065  0.022  0.022 0.324
drat         0.787  -2.484  4.058  1.635 0.630
wt          -3.715  -7.504  0.074  1.894 0.050
qsec         0.821  -0.641  2.283  0.731 0.261
vs           0.318  -3.891  4.527  2.105 0.880
am           2.520  -1.593  6.634  2.057 0.220
gear         0.655  -2.331  3.642  1.493 0.661
carb        -0.199  -1.857  1.458  0.829 0.810

The challenge is to adapt the vif() function to work with biglm. I was
able to obtain the model matrix Vi using the following method.

biglm supports the calculation of the Huber/White sandwich covariance
matrix:

> reg_biglm <- biglm(model, mtcars,, TRUE)
> sandwich <- vcov(reg_biglm)

Out of this, I can extract the model-based scaled variance-covariance
matrix:

> v.scaled <- attr(vcov(reg_biglm), "model-based")

Based on my research (big thanks to Assoc. Professor, Biostatistics
Thomas Lumley at University of Washington):

sandwich = v.scaled × model_matrix × v.scaled

It is the model_matrix I need to obtain to compute the VIF's. I do so
by inverting v.scaled and moving them to the other side of the
equation.

> v.scaled.inverted <- solve(v.scaled)
> model_matrix <- v.scaled.inverted %*% sandwich %*% v.scaled.inverted

At this point, I have both the variance-covariance matrix and the
model matrix, so I should be able to compute the VIF's using the
function above. However, on examining the contents of the variance-
covariance matrix, I see that the one from the lm() function is
scaled, while the one I obtained from the sandwich calculations is
not:

> head(v.scaled)
              (Intercept)           cyl          disp
hp          drat           wt          qsec           vs
am          gear
(Intercept) 350.359197487 -13.163824437 -0.0059101820 -0.0266100189
-12.941827906  3.156000750 -10.546511742  3.646214479 -8.8845922019
-11.357908146
cyl         -13.163824437   1.092073828 -0.0049669381 -0.0041817837
0.471219880  0.223389955   0.209717934  0.705445371  0.5605478034
0.550163881
disp         -0.005910182  -0.004966938  0.0003188903 -0.0002040364
-0.003384402 -0.026026366   0.003725863  0.003764231  0.0009902057
-0.002094405
hp           -0.026610019  -0.004181784 -0.0002040364  0.0004738710
0.003095272  0.009914533   0.001724661 -0.012474163 -0.0021256566
-0.002907768
drat        -12.941827906   0.471219880 -0.0033844018  0.0030952720
2.674445074  0.521759369   0.043438912 -0.102380147 -0.5260445782
-0.182954006
wt            3.156000750   0.223389955 -0.0260263658  0.0099145334
0.521759369  3.588805538  -0.702017025  0.338026038  0.3675570022
0.513001888

> head(summary(reg_lm)$cov.unscaled)
              (Intercept)           cyl          disp
hp          drat           wt          qsec            vs
am          gear
(Intercept) 49.8835321876 -1.8742423910 -8.414814e-04 -3.788688e-03
-1.8426349117  0.449345889 -1.5015939690  0.5191416655 -1.2649727601
-1.6171191755
cyl         -1.8742423910  0.1554875692 -7.071840e-04 -5.953951e-04
0.0670914657  0.031805873  0.0298592741  0.1004400830  0.0798098197
0.0783313749
disp        -0.0008414814 -0.0007071840  4.540305e-05 -2.905034e-05
-0.0004818652 -0.003705589  0.0005304819  0.0005359447  0.0001409838
-0.0002981977
hp          -0.0037886881 -0.0005953951 -2.905034e-05  6.746893e-05
0.0004406994  0.001411614  0.0002455543 -0.0017760496 -0.0003026473
-0.0004140029
drat        -1.8426349117  0.0670914657 -4.818652e-04  4.406994e-04
0.3807828305  0.074287190  0.0061847567 -0.0145767070 -0.0748973106
-0.0260486727
wt           0.4493458888  0.0318058726 -3.705589e-03  1.411614e-03
0.0742871900  0.510967881 -0.0999519610  0.0481275585  0.0523321257
0.0730403152

They differ by a constant factor:

> head(v.scaled) / head(summary(reg_lm)$cov.unscaled)
            (Intercept)      cyl     disp       hp     drat
wt     qsec       vs       am     gear     carb
(Intercept)    7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
cyl            7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
disp           7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
hp             7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
drat           7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
wt             7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544

This factor turns out to be the squared value of the regression's
sigma:

> summary(reg_lm)$sigma ^ 2
[1] 7.023544

Sigma is defined as:
the square root of the estimated variance of the random error

sigma^2 = 1/(n-p) Sum(w[i] R[i]^2),
where R[i] is the i-th residual, residuals[i].

The ultimate challenge is that sigma is returned by lm(), but not by
biglm(). To calcualte sigma, it looks like I  would need to compute
the residuals for the entire dataset which defeats the purpose of biglm
(). Does anybody know if there is an easier way of obtaining sigma?
Any guidance on this subject would be greatly appreciated!