[R] Pre-model Variable Reduction

Frank E Harrell Jr f.harrell at vanderbilt.edu
Tue Dec 9 18:21:07 CET 2008


Mark Difford wrote:
> Hi All,
> 
> I beg to differ with Ravi Varadhan's perspective. While it is true that
> principal component analysis does not itself do variable selection, it is an
> important method for pointing the way to what to select. This is what the
> methods in the subselect package rely on. (One of its authors was I believe
> a student of Jolliffe's). For a modern perspective on this, see the
> following paper:
> 
> Debashis Paul, Eric Bair, Trevor Hastie and Robert Tibshirani:
> "Preconditioning" for feature selection and regression in high-dimensional
> problems We show that supervised principal components followed by a variable
> selection procedure is an effective approach for variable selection in very
> high dimension. Annals of Statistics 36(4), 2008, 1595-1618.
> 
> http://www-stat.stanford.edu/~hastie/Papers/Preconditioning_Annals.pdf
> 
> Regards, Mark.

Mark,

Slightly more relevant is the unsupervised sparse principal component 
methods described in the following references.  If anyone knows of 
better references for this please let me know.  -Frank


@Article{zou06spa,
   author = 		 {Zhou, Hui and Hastie, Trevor and Tibshirani, Robert},
   title = 		 {Sparse principal component analysis},
   journal = 	 J Comp Graph Stat,
   year = 		 2006,
   volume =		 15,
   pages =		 {265-286},
   annote =		 {gene microarray;lasso/elastic net;multivariate
analysis;data reduction;singular value
decomposition;thresholding;principal components analysis that shrinks
some loadings to zero}
}
@Article{wit08tes,
   author = 		 {Witten, Daniela M. and Tibshirani, Robert},
   title = 		 {Testing significance of features by lassoed principal 
components},
   journal = 	 Annals Appl Stat,
   year = 		 2008,
   volume = 	 2,
   number = 	 3,
   pages = 	 {986-1012},
   annote = 	 {reduction in false discovery rates over using a vector of 
t-statistics;borrowing strength across genes;``one would not expect a 
single gene to be associated with the outcome, since, in practice, many 
genes work together to effect a particular phenotype.  LPC effectively 
down-weights individual genes that are associated with the outcome but 
that do not share an expression pattern with a larger group of genes, 
and instead favors large groups of genes that appear to be 
differentially-expressed.'';regress principal components on outcome}
}

> 
> 
> Ravi Varadhan wrote:
>> Principal components analysis does "dimensionality reduction" but NOT
>> "variable reduction".  However, Jolliffe's 2004 book on PCA does discuss
>> the
>> problem of selecting a subset of variables, with the goal of representing
>> the internal variation of original multivariate vector as well as possible
>> (see Section 6.3 of that book).  I do not think that these methods can
>> handle missing data.  The most important issue is to think about the goal
>> of
>> variable reduction and then choose an appropriate optimality criterion for
>> achieving that goal.  In most instances of variable selection, the
>> criterion
>> that is optimized is never explicitly considered.
>>
>> Ravi.
>>
>> ----------------------------------------------------------------------------
>> -------
>>
>> Ravi Varadhan, Ph.D.
>>
>> Assistant Professor, The Center on Aging and Health
>>
>> Division of Geriatric Medicine and Gerontology 
>>
>> Johns Hopkins University
>>
>> Ph: (410) 502-2619
>>
>> Fax: (410) 614-9625
>>
>> Email: rvaradhan at jhmi.edu
>>
>> Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
>>
>>  
>>
>> ----------------------------------------------------------------------------
>> --------
>>
>>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>> On
>> Behalf Of Gabor Grothendieck
>> Sent: Tuesday, December 09, 2008 8:00 AM
>> To: Harsh
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Pre-model Variable Reduction
>>
>> See:
>>
>> ?prcomp
>> ?princomp
>>
>> On Tue, Dec 9, 2008 at 5:34 AM, Harsh <singhalblr at gmail.com> wrote:
>>> Hello All,
>>> I am trying to carry out variable reduction. I do not have information 
>>> about the dependent variable, and have only the X variables as it 
>>> were.
>>> In selecting variables I wish to keep, I have considered the following
>> criteria.
>>> 1) Percentage of missing value in each column/variable
>>> 2) Variance of each variable, with a cut-off value.
>>>
>>> I recently came across Weka and found that there is an RWeka package 
>>> which would allow me to make use of Weka through R.
>>> Weka provides a "Genetic search" variable reduction method, but I 
>>> could not find its R code implementation in the RWeka Pdf file on 
>>> CRAN.
>>>
>>> I looked for other R packages that allow me to do variable reduction 
>>> without considering a dependent variable. I came across 'dprep'
>>> package but it does not have a Windows implementation.
>>>
>>> Moreover, I have a dataset that contains continuous and categorical 
>>> variables, some categorical variables having 3 levels, 10 levels and 
>>> so on, till a max 50 levels (E.g. States in the USA).
>>>
>>> Any suggestions in this regard will be much appreciated.
>>>
>>> Thank you
>>>
>>> Harsh Singhal
>>> Decision Systems,
>>> Mu Sigma, Inc.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> 


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-help mailing list