[R] Overfitting/Calibration plots (Statistics question)
Mark Seeto
markseeto at gmail.com
Fri Apr 9 00:52:04 CEST 2010
This isn't a question about R, but I'm hoping someone will be willing
to help. I've been looking at calibration plots in multiple regression
(plotting observed response Y on the vertical axis versus predicted
response [Y hat] on the horizontal axis).
According to Frank Harrell's "Regression Modeling Strategies" book
(pp. 61-63), when making such a plot on new data (having obtained a
model from other data) we should expect the points to be around a line
with slope < 1, indicating overfitting. As he writes, "Typically, low
predictions will be too low and high predictions too high."
However, when I make these plots, both with real data and with simple
simulated data, I get the opposite: the points are scattered around a
line with slope >1. Low predictions are too high and high predictions
are too low.
For example, I generated 200 cases, fitted models on the first half of
the data, and made calibration plots for those models on the second
half of the data:
> x1 <- rnorm(200, 0, 1)
> x2 <- rnorm(200, 0, 1)
> x3 <- rnorm(200, 0, 1)
> x4 <- rnorm(200, 0, 1)
> x5 <- rnorm(200, 0, 1)
> x6 <- rnorm(200, 0, 1)
> y <- x1 + x2 + rnorm(200, 0, 2)
> d <- data.frame(y, x1, x2, x3, x4, x5, x6)
>
> lm1 <- lm(y ~ ., data = d[1:100,])
> lm2 <- lm(y ~ x1 + x2, data = d[1:100,])
>
> plot(predict(lm1, d[101:200, ]), d$y[101:200]); abline(0,1)
> x11(); plot(predict(lm2, d[101:200, ]), d$y[101:200]); abline(0,1)
The plots for both lm1 and lm2 show the points scattered around a line
with slope > 1, contrary to what Frank Harrell says should happen.
I am either misunderstanding something or making a mistake in the code
(I'm almost 100% certain I'm not mixing up the axes). I would be most
appreciative if anyone could explain where I'm going wrong.
Thanks for any help you can provide.
Mark Seeto
More information about the R-help
mailing list