[R] Stepwise Regression and PLS

Frank E Harrell Jr feh3k at spamcop.net
Mon Feb 2 04:42:31 CET 2004


On Sun, 1 Feb 2004 19:13:49 -0800 (PST)
Jinsong Zhao <jinsong_zh at yahoo.com> wrote:

> 
> --- Frank E Harrell Jr <feh3k at spamcop.net> wrote:
> > On Sun, 1 Feb 2004 11:09:28 -0800 (PST)
> > Jinsong Zhao <jinsong_zh at yahoo.com> wrote:
> > 
> > > Dear all,
> > > 
> > > I am a newcomer to R. I intend to using R to do
> > > stepwise regression and PLS with a data set (a
> > 55x20
> > > matrix, with one dependent and 19 independent
> > > variable). Based on the same data set, I have done
> > the
> > > same work using SPSS and SAS. However, there is
> > much
> > > difference between the results obtained by R and
> > SPSS
> > > or SAS.
> > > 
> > > In the case of stepwise, SPSS gave out a model
> > with 4
> > > independent variable, but with step(), R gave out
> > a
> > > model with 10 and much higher R2. Furthermore,
> > > regsubsets() also indicate the 10 variable is one
> > of
> > > the best regression subset. How to explain this
> > > difference? And in the case of my data set, how
> > many
> > > variables that enter the model would be
> > reasonable?
> > > 
> > > In the case of PLS, the results of mvr function of
> > > pls.pcr package is also different with that of
> > SAS.
> > > Although the number of optimum latent variables is
> > > same, the difference between R2 is much large.
> > Why?
> > > 
> > > Any comment and suggestion is very appreciated.
> > Thanks
> > > in advance!
> > > 
> > > Best wishes,
> > > 
> > > Jinsong Zhao
> > > 
> > 
> > In your case SPSS, SAS, R, S-Plus, Stata, Systat,
> > Statistica, and every
> > other package will agree in one sense, because
> > results from all of them
> > will be virtually meaningless.  Simulate some data
> > from a known model and
> > you'll quickly find out why stepwise variable
> > selection is often a train
> > wreck.
> > 
> > ---
> > Frank E Harrell Jr   Professor and Chair          
> > School of Medicine
> >                      Department of Biostatistics  
> > Vanderbilt University
> 
> For the case of stepwise regression, I have found that
> the subsets I got using regsubsets() are collinear.
> However, the variables in SPSS's result are not
> collinear. I wonder what I should do to get a same or
> better linear model.

I think you missed the point.  None of the variable selection procedures
will provide results that have a fair probability of replicating in
another sample.

FH
---
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list