[R] Model validation and penalization with rms package

Tue Jun 29 03:31:04 CEST 2010

I’ve been using Frank Harrell’s rms package to do bootstrap model
validation. Is it the case that the optimum penalization may still
give a model which is substantially overfitted?

I calculated corrected R^2, optimism in R^2, and corrected slope for
various penalties for a simple example:

x1 <- rnorm(45)
x2 <- rnorm(45)
x3 <- rnorm(45)
y <- x1 + 2*x2 + rnorm(45,0,3)

ols0 <- ols(y ~ x1 + x2 + x3, x=TRUE, y=TRUE)

corrected.Rsq <- rep(0,60)
optimism.Rsq <- rep(0,60)
corrected.slope <- rep(0,60)

for (pen in 1:60) {
olspen <- ols(y ~ x1 + x2 + x3, penalty=pen, x=TRUE, y=TRUE)
val <- validate(olspen, B=200)
corrected.Rsq[pen] <- val["R-square", "index.corrected"]
optimism.Rsq[pen] <- val["R-square", "optimism"]
corrected.slope[pen] <- val["Slope", "index.corrected"]
}
plot(corrected.Rsq)
x11(); plot(optimism.Rsq)
x11(); plot(corrected.slope)
p <- pentrace(ols0, penalty=1:60)
ols9 <- ols(y ~ x1 + x2 + x3, penalty=9, x=TRUE, y=TRUE)
validate(ols9, B=200)
 		index.orig  training       test           optimism    index.corrected   n
R-square	0.2523404 0.2820734  0.1911878  0.09088563       0.1614548 200
MSE 	7.8497722 7.3525300  8.4918212 -1.13929116       8.9890634 200
Intercept   0.0000000 0.0000000 -0.1353572  0.13535717      -0.1353572 200
Slope       1.0000000 1.0000000  1.1707137 -0.17071372       1.1707137 200

pentrace tells me that of the penalties 1, 2,..., 60, corrected AIC is
maximised by a penalty of 9. This is consistent with the corrected R^2
plot, which shows a maximum somewhere around 10. However, a penalty of
9 still gives an R^2 optimism of 0.09 (training R^2=0.28, test
R^2=0.19), suggesting overfitting.

Do we just have to live with this R^2 optimism? It can be decreased by
taking a bigger penalty, but then the corrected R^2 is reduced.  Also,
a penalty of 9 gives a corrected slope of about 1.17 (corrected slope
of 1 is achieved with a penalty of about 1 or 2).

Thanks for any help/advice you can give.

Mark
--
Mark Seeto
Statistician

National Acoustic Laboratories
A Division of Australian Hearing