[R] variable selection in logistic
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Thu Sep 3 22:45:09 CEST 2009
You'll need to do a huge amount of background reading first. These
stepwise options do not incorporate penalization.
Frank
annie Zhang wrote:
> Hi, Frank,
>
> If I want to do prediction as well as to select important predictors,
> which may be the best function to use when I have 35 samples and 35
> predictors (penalized logistic with variable selection)? I saw there is
> a 'fastbw' function in the Design package. And there is a 'step.plr'
> function in the 'stepPlr' package.
>
> Thank you,
>
> Annie
>
> On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr
> <f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>> wrote:
>
> annie Zhang wrote:
>
> Thank you for all your reply.
> Actually as Bert said, besides predicion, I also need variable
> selection (I need to know which variables are important). As far
> as the sample size and number of variables, both of them are
> small around 35. How can I get accurate prediction as long as
> good predictors?
> Annie
>
>
> It is next to impossible to find a unique list of 'important'
> variables without having 50 times as many subjects as potential
> predictors, unless your signal:noise ratio is stunning.
>
> Frank
>
>
> On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter
> <gunter.berton at gene.com <mailto:gunter.berton at gene.com>
> <mailto:gunter.berton at gene.com <mailto:gunter.berton at gene.com>>>
> wrote:
>
> But let's be clear here folks:
>
> Ben's comment is apropos: ""As many variables as samples" is
> particularly
> scary."
>
> (Aside -- how much scarier then are -omics analyses in which the
> number of
> variables is thousands of times the number of samples?)
>
> Sensible penalization (it's usually not too sensitive to the
> details) is
> only another way of obtaining a parsimonious model with good
> (in the
> sense
> of minimizing overall prediction error: bias + variance)
> prediction
> properties. Alas, this is often not what scientists want:
> they use
> variable
> selection to find the "right" covariates, the "most
> important" variables
> affecting the response. But this is beyond the power of empirical
> modeling
> here: "as many variables as samples" almost guarantees that there
> will be
> many different and even nonoverlapping subsets of variables that
> are, within
> statistical noise, equally "optimal" predictors. That is,
> variable
> selection
> in such circumstances is just a pretty sophisticated random
> number
> generator
> -- ergo Frank's Draconian warnings. Penalization produces better
> prediction
> engines with better properties, but it cannot overcome the
> "as many
> variables as samples" problem either. Entropy rules. If what is
> sought is a
> way to determine the "truly important" variables, then the
> study must be
> designed to provide the information to do so. You don't get
> something for
> nothing.
>
> Cheers,
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org
> <mailto:r-help-bounces at r-project.org>
> <mailto:r-help-bounces at r-project.org
> <mailto:r-help-bounces at r-project.org>>
> [mailto:r-help-bounces at r-project.org
> <mailto:r-help-bounces at r-project.org>
> <mailto:r-help-bounces at r-project.org
> <mailto:r-help-bounces at r-project.org>>] On
> Behalf Of Frank E Harrell Jr
> Sent: Wednesday, September 02, 2009 9:07 PM
> To: annie Zhang
> Cc: r-help at r-project.org <mailto:r-help at r-project.org>
> <mailto:r-help at r-project.org <mailto:r-help at r-project.org>>
> Subject: Re: [R] variable selection in logistic
>
> annie Zhang wrote:
> > Hi, Frank,
> >
> > You mean the backward and forward stepwise selection is
> bad? You also
> > suggest the penalized logistic regression is the best
> choice? Is
> there
> > any function to do it as well as selecting the best penalty?
> >
> > Annie
>
> All variable selection is bad unless its in the context of
> penalization.
> You'll need penalized logistic regression not necessarily with
> variable selection, for example a quadratic penalty as in a
> case study
> in my book, or an L1 penalty (lasso) using other packages.
>
> Frank
>
> >
> > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
> > <f.harrell at vanderbilt.edu
> <mailto:f.harrell at vanderbilt.edu>
> <mailto:f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>>
> <mailto:f.harrell at vanderbilt.edu
> <mailto:f.harrell at vanderbilt.edu>
> <mailto:f.harrell at vanderbilt.edu
> <mailto:f.harrell at vanderbilt.edu>>>>
>
> wrote:
> >
> > David Winsemius wrote:
> >
> >
> > On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
> >
> > Hi, R users,
> >
> > What may be the best function in R to do variable
> selection
> > in logistic
> > regression?
> >
> >
> > PhD theses, and books by famous statisticians have
> been
> pursuing
> > the answer to that question for decades.
> >
> > I have the same number of variables as the
> number of
> samples,
> > and I want to select the best variablesfor
> prediction. Is
> > there any function
> > doing forward selection followed by backward
> elimination in
> > stepwise
> > logistic regression?
> >
> >
> > You should probably be reading up on penalized
> regression
> > methods. The stepwise procedures reporting unadjusted
> > "significance" made available by SAS and SPSS to
> the unwary
> > neophyte user have very poor statistical properties.
> >
> > --
> >
> > David Winsemius, MD
> >
> >
> > Amen to that.
> >
> > Annie, resist the temptation. These methods bite.
> >
> > Frank
> >
> >
> > Heritage Laboratories
> > West Hartford, CT
> >
> > ______________________________________________
> > R-help at r-project.org <mailto:R-help at r-project.org>
> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>
> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>>
> mailing list
>
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> <http://www.r-project.org/posting-guide.html>
> <http://www.r-project.org/posting-guide.html>
> > <http://www.r-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained,
> reproducible code.
> >
> >
> >
> > --
> > Frank E Harrell Jr Professor and Chair
> School of
> Medicine
> > Department of Biostatistics
> Vanderbilt
> University
> >
> >
>
>
> --
> Frank E Harrell Jr Professor and Chair School of
> Medicine
> Department of Biostatistics Vanderbilt
> University
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org>
> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> <http://www.r-project.org/posting-guide.html>
> <http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible
> code.
>
>
>
>
> --
> Frank E Harrell Jr Professor and Chair School of Medicine
> Department of Biostatistics Vanderbilt University
>
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list