[R] variable selection in logistic
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Thu Sep 3 19:11:14 CEST 2009
annie Zhang wrote:
> Thank you for all your reply.
> Actually as Bert said, besides predicion, I also need variable selection
> (I need to know which variables are important). As far as the sample
> size and number of variables, both of them are small around 35. How can
> I get accurate prediction as long as good predictors?
> Annie
It is next to impossible to find a unique list of 'important' variables
without having 50 times as many subjects as potential predictors, unless
your signal:noise ratio is stunning.
Frank
>
> On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter <gunter.berton at gene.com
> <mailto:gunter.berton at gene.com>> wrote:
>
> But let's be clear here folks:
>
> Ben's comment is apropos: ""As many variables as samples" is
> particularly
> scary."
>
> (Aside -- how much scarier then are -omics analyses in which the
> number of
> variables is thousands of times the number of samples?)
>
> Sensible penalization (it's usually not too sensitive to the details) is
> only another way of obtaining a parsimonious model with good (in the
> sense
> of minimizing overall prediction error: bias + variance) prediction
> properties. Alas, this is often not what scientists want: they use
> variable
> selection to find the "right" covariates, the "most important" variables
> affecting the response. But this is beyond the power of empirical
> modeling
> here: "as many variables as samples" almost guarantees that there
> will be
> many different and even nonoverlapping subsets of variables that
> are, within
> statistical noise, equally "optimal" predictors. That is, variable
> selection
> in such circumstances is just a pretty sophisticated random number
> generator
> -- ergo Frank's Draconian warnings. Penalization produces better
> prediction
> engines with better properties, but it cannot overcome the "as many
> variables as samples" problem either. Entropy rules. If what is
> sought is a
> way to determine the "truly important" variables, then the study must be
> designed to provide the information to do so. You don't get
> something for
> nothing.
>
> Cheers,
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org
> <mailto:r-help-bounces at r-project.org>
> [mailto:r-help-bounces at r-project.org
> <mailto:r-help-bounces at r-project.org>] On
> Behalf Of Frank E Harrell Jr
> Sent: Wednesday, September 02, 2009 9:07 PM
> To: annie Zhang
> Cc: r-help at r-project.org <mailto:r-help at r-project.org>
> Subject: Re: [R] variable selection in logistic
>
> annie Zhang wrote:
> > Hi, Frank,
> >
> > You mean the backward and forward stepwise selection is bad? You also
> > suggest the penalized logistic regression is the best choice? Is
> there
> > any function to do it as well as selecting the best penalty?
> >
> > Annie
>
> All variable selection is bad unless its in the context of penalization.
> You'll need penalized logistic regression not necessarily with
> variable selection, for example a quadratic penalty as in a case study
> in my book, or an L1 penalty (lasso) using other packages.
>
> Frank
>
> >
> > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
> > <f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>
> <mailto:f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>>>
> wrote:
> >
> > David Winsemius wrote:
> >
> >
> > On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
> >
> > Hi, R users,
> >
> > What may be the best function in R to do variable
> selection
> > in logistic
> > regression?
> >
> >
> > PhD theses, and books by famous statisticians have been
> pursuing
> > the answer to that question for decades.
> >
> > I have the same number of variables as the number of
> samples,
> > and I want to select the best variablesfor prediction. Is
> > there any function
> > doing forward selection followed by backward
> elimination in
> > stepwise
> > logistic regression?
> >
> >
> > You should probably be reading up on penalized regression
> > methods. The stepwise procedures reporting unadjusted
> > "significance" made available by SAS and SPSS to the unwary
> > neophyte user have very poor statistical properties.
> >
> > --
> >
> > David Winsemius, MD
> >
> >
> > Amen to that.
> >
> > Annie, resist the temptation. These methods bite.
> >
> > Frank
> >
> >
> > Heritage Laboratories
> > West Hartford, CT
> >
> > ______________________________________________
> > R-help at r-project.org <mailto:R-help at r-project.org>
> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> <http://www.r-project.org/posting-guide.html>
> > <http://www.r-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained,
> reproducible code.
> >
> >
> >
> > --
> > Frank E Harrell Jr Professor and Chair School of
> Medicine
> > Department of Biostatistics Vanderbilt
> University
> >
> >
>
>
> --
> Frank E Harrell Jr Professor and Chair School of Medicine
> Department of Biostatistics Vanderbilt University
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> <http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list