[R] Stepwise regression and PLS
Thomas Lumley
tlumley at u.washington.edu
Mon Feb 2 18:47:32 CET 2004
On Sun, 1 Feb 2004, [gb2312] Jinsong Zhao wrote:
>
> In the case of stepwise, SPSS gave out a model with 4 independent
> variable, but with step(), R gave out a model with 10 and much higher
> R2. Furthermore, regsubsets() also indicate the 10 variable is one of
> the best regression subset. How to explain this difference? And in the
> case of my data set, how many variables that enter the model would be
> reasonable?
>
Most likely because step() uses AIC and SPSS uses a p-value criterion, so
the models are `best' in different ways. regsubsets() gives best models
of each size, so it doesn't address the 4 vs 10 issue.
This isn't what regsubsets() is intended for. If you want a single model
for prediction, you need a method based on an honest estimate of
prediction error and if you want a single model to explain relationships
you need to think about relationships.
While people seem to want to use it for finding a single model,
the purpose of regsubsets() is to give you many models, precisely as a
way around the problem of instability everyone else has pointed out.
Given a large number of models you can see what features
are common to them, or you can do a crude but reasonably effective
approximation to model averaging.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list