[R] Stepwise regression and PLS

Thomas Lumley tlumley at u.washington.edu
Mon Feb 2 18:47:32 CET 2004


On Sun, 1 Feb 2004, [gb2312] Jinsong Zhao wrote:

>
> In the case of stepwise, SPSS gave out a model with 4 independent
> variable, but with step(), R gave out a model with 10 and much higher
> R2. Furthermore, regsubsets() also indicate the 10 variable is one of
> the best regression subset. How to explain this difference? And in the
> case of my data set, how many variables that enter the model would be
> reasonable?
>

Most likely because step() uses AIC and SPSS uses a p-value criterion, so
the models are `best' in different ways.   regsubsets() gives best models
of each size, so it doesn't address the 4 vs 10 issue.

This isn't what regsubsets() is intended for.  If you want a single model
for prediction, you need a method based on an honest estimate of
prediction error and if you want a single model to explain relationships
you need to think about relationships.

While people seem to want to use it for finding a single model,
the purpose of regsubsets() is to give you many models,  precisely as a
way around the problem of instability everyone else has pointed out.
Given a large number of models you can see what features
are common to them, or you can do a crude but reasonably effective
approximation to model averaging.


	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list