[R] variable selection in logistic

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Sep 3 22:45:09 CEST 2009


You'll need to do a huge amount of background reading first.  These 
stepwise options do not incorporate penalization.

Frank

annie Zhang wrote:
> Hi, Frank,
>  
> If I want to do prediction as well as to select important predictors, 
> which may be the best function to use when I have 35 samples and 35 
> predictors (penalized logistic with variable selection)? I saw there is 
> a 'fastbw' function in the Design package. And there is a 'step.plr' 
> function in the 'stepPlr' package.
>  
> Thank you,
>  
> Annie
> 
> On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr 
> <f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>> wrote:
> 
>     annie Zhang wrote:
> 
>         Thank you for all your reply.
>         Actually as Bert said, besides predicion, I also need variable
>         selection (I need to know which variables are important). As far
>         as the sample size and number of variables, both of them are
>         small around 35. How can I get accurate prediction as long as
>         good predictors?
>         Annie
> 
> 
>     It is next to impossible to find a unique list of 'important'
>     variables without having 50 times as many subjects as potential
>     predictors, unless your signal:noise ratio is stunning.
> 
>     Frank
> 
> 
>         On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter
>         <gunter.berton at gene.com <mailto:gunter.berton at gene.com>
>         <mailto:gunter.berton at gene.com <mailto:gunter.berton at gene.com>>>
>         wrote:
> 
>            But let's be clear here folks:
> 
>            Ben's comment is apropos: ""As many variables as samples" is
>            particularly
>            scary."
> 
>            (Aside -- how much scarier then are -omics analyses in which the
>            number of
>            variables is thousands of times the number of samples?)
> 
>            Sensible penalization (it's usually not too sensitive to the
>         details) is
>            only another way of obtaining a parsimonious model with good
>         (in the
>            sense
>            of minimizing overall prediction error: bias + variance)
>         prediction
>            properties. Alas, this is often not what scientists want:
>         they use
>            variable
>            selection to find the "right" covariates, the "most
>         important" variables
>            affecting the response. But this is beyond the power of empirical
>            modeling
>            here: "as many variables as samples" almost guarantees that there
>            will be
>            many different and even nonoverlapping subsets of variables that
>            are, within
>            statistical noise, equally "optimal" predictors. That is,
>         variable
>            selection
>            in such circumstances is just a pretty sophisticated random
>         number
>            generator
>            -- ergo Frank's Draconian warnings. Penalization produces better
>            prediction
>            engines with better properties, but it cannot overcome the
>         "as many
>            variables as samples" problem either. Entropy rules. If what is
>            sought is a
>            way to determine the "truly important" variables, then the
>         study must be
>            designed to provide the information to do so. You don't get
>            something for
>            nothing.
> 
>            Cheers,
> 
>            Bert Gunter
>            Genentech Nonclinical Biostatistics
> 
> 
>            -----Original Message-----
>            From: r-help-bounces at r-project.org
>         <mailto:r-help-bounces at r-project.org>
>            <mailto:r-help-bounces at r-project.org
>         <mailto:r-help-bounces at r-project.org>>
>            [mailto:r-help-bounces at r-project.org
>         <mailto:r-help-bounces at r-project.org>
>            <mailto:r-help-bounces at r-project.org
>         <mailto:r-help-bounces at r-project.org>>] On
>            Behalf Of Frank E Harrell Jr
>            Sent: Wednesday, September 02, 2009 9:07 PM
>            To: annie Zhang
>            Cc: r-help at r-project.org <mailto:r-help at r-project.org>
>         <mailto:r-help at r-project.org <mailto:r-help at r-project.org>>
>            Subject: Re: [R] variable selection in logistic
> 
>            annie Zhang wrote:
>             > Hi, Frank,
>             >
>             > You mean the backward and forward stepwise selection is
>         bad? You also
>             > suggest the penalized logistic regression is the best
>         choice? Is
>            there
>             > any function to do it as well as selecting the best penalty?
>             >
>             > Annie
> 
>            All variable selection is bad unless its in the context of
>         penalization.
>             You'll need penalized logistic regression not necessarily with
>            variable selection, for example a quadratic penalty as in a
>         case study
>            in my book, or an L1 penalty (lasso) using other packages.
> 
>            Frank
> 
>             >
>             > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
>             > <f.harrell at vanderbilt.edu
>         <mailto:f.harrell at vanderbilt.edu>
>         <mailto:f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>>
>            <mailto:f.harrell at vanderbilt.edu
>         <mailto:f.harrell at vanderbilt.edu>
>         <mailto:f.harrell at vanderbilt.edu
>         <mailto:f.harrell at vanderbilt.edu>>>>
> 
>            wrote:
>             >
>             >     David Winsemius wrote:
>             >
>             >
>             >         On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
>             >
>             >             Hi, R users,
>             >
>             >             What may be the best function in R to do variable
>            selection
>             >             in logistic
>             >             regression?
>             >
>             >
>             >         PhD theses, and books by famous statisticians have
>         been
>            pursuing
>             >         the answer to that question for decades.
>             >
>             >             I have the same number of variables as the
>         number of
>            samples,
>             >             and I want to select the best variablesfor
>         prediction. Is
>             >             there any function
>             >             doing forward selection followed by backward
>            elimination in
>             >             stepwise
>             >             logistic regression?
>             >
>             >
>             >         You should probably be reading up on penalized
>         regression
>             >         methods. The stepwise procedures reporting unadjusted
>             >         "significance" made available by SAS and SPSS to
>         the unwary
>             >         neophyte user have very poor statistical properties.
>             >
>             >         --
>             >
>             >         David Winsemius, MD
>             >
>             >
>             >     Amen to that.
>             >
>             >     Annie, resist the temptation.  These methods bite.
>             >
>             >     Frank
>             >
>             >
>             >         Heritage Laboratories
>             >         West Hartford, CT
>             >
>             >         ______________________________________________
>             >         R-help at r-project.org <mailto:R-help at r-project.org>
>         <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
>            <mailto:R-help at r-project.org <mailto:R-help at r-project.org>
>         <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>>
>         mailing list
> 
>             >         https://stat.ethz.ch/mailman/listinfo/r-help
>             >         PLEASE do read the posting guide
>             >         http://www.R-project.org/posting-guide.html
>         <http://www.r-project.org/posting-guide.html>
>            <http://www.r-project.org/posting-guide.html>
>             >         <http://www.r-project.org/posting-guide.html>
>             >         and provide commented, minimal, self-contained,
>            reproducible code.
>             >
>             >
>             >
>             >     --
>             >     Frank E Harrell Jr   Professor and Chair          
>         School of
>            Medicine
>             >                         Department of Biostatistics  
>         Vanderbilt
>            University
>             >
>             >
> 
> 
>            --
>            Frank E Harrell Jr   Professor and Chair           School of
>         Medicine
>                                 Department of Biostatistics   Vanderbilt
>         University
> 
>            ______________________________________________
>            R-help at r-project.org <mailto:R-help at r-project.org>
>         <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
>         mailing list
>            https://stat.ethz.ch/mailman/listinfo/r-help
>            PLEASE do read the posting guide
>            http://www.R-project.org/posting-guide.html
>         <http://www.r-project.org/posting-guide.html>
>            <http://www.r-project.org/posting-guide.html>
>            and provide commented, minimal, self-contained, reproducible
>         code.
> 
> 
> 
> 
>     -- 
>     Frank E Harrell Jr   Professor and Chair           School of Medicine
>                         Department of Biostatistics   Vanderbilt University
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list