[R] Pre-model Variable Reduction

Tue Dec 9 18:32:42 CET 2008

Hi Frank,

>> If anyone knows of better references for this please let me know.

Many thanks: I was not aware of the Witten paper. If I turn up anything else
I will be sure to let you know.

Best Regards, Mark.

Frank E Harrell Jr wrote:
> 
> Mark Difford wrote:
>> Hi All,
>> 
>> I beg to differ with Ravi Varadhan's perspective. While it is true that
>> principal component analysis does not itself do variable selection, it is
>> an
>> important method for pointing the way to what to select. This is what the
>> methods in the subselect package rely on. (One of its authors was I
>> believe
>> a student of Jolliffe's). For a modern perspective on this, see the
>> following paper:
>> 
>> Debashis Paul, Eric Bair, Trevor Hastie and Robert Tibshirani:
>> "Preconditioning" for feature selection and regression in
>> high-dimensional
>> problems We show that supervised principal components followed by a
>> variable
>> selection procedure is an effective approach for variable selection in
>> very
>> high dimension. Annals of Statistics 36(4), 2008, 1595-1618.
>> 
>> http://www-stat.stanford.edu/~hastie/Papers/Preconditioning_Annals.pdf
>> 
>> Regards, Mark.
> 
> Mark,
> 
> Slightly more relevant is the unsupervised sparse principal component 
> methods described in the following references.  If anyone knows of 
> better references for this please let me know.  -Frank
> 
> 
> @Article{zou06spa,
>    author = 		 {Zhou, Hui and Hastie, Trevor and Tibshirani, Robert},
>    title = 		 {Sparse principal component analysis},
>    journal = 	 J Comp Graph Stat,
>    year = 		 2006,
>    volume =		 15,
>    pages =		 {265-286},
>    annote =		 {gene microarray;lasso/elastic net;multivariate
> analysis;data reduction;singular value
> decomposition;thresholding;principal components analysis that shrinks
> some loadings to zero}
> }
> @Article{wit08tes,
>    author = 		 {Witten, Daniela M. and Tibshirani, Robert},
>    title = 		 {Testing significance of features by lassoed principal 
> components},
>    journal = 	 Annals Appl Stat,
>    year = 		 2008,
>    volume = 	 2,
>    number = 	 3,
>    pages = 	 {986-1012},
>    annote = 	 {reduction in false discovery rates over using a vector of 
> t-statistics;borrowing strength across genes;``one would not expect a 
> single gene to be associated with the outcome, since, in practice, many 
> genes work together to effect a particular phenotype.  LPC effectively 
> down-weights individual genes that are associated with the outcome but 
> that do not share an expression pattern with a larger group of genes, 
> and instead favors large groups of genes that appear to be 
> differentially-expressed.'';regress principal components on outcome}
> }
> 
>> 
>> 
>> Ravi Varadhan wrote:
>>> Principal components analysis does "dimensionality reduction" but NOT
>>> "variable reduction".  However, Jolliffe's 2004 book on PCA does discuss
>>> the
>>> problem of selecting a subset of variables, with the goal of
>>> representing
>>> the internal variation of original multivariate vector as well as
>>> possible
>>> (see Section 6.3 of that book).  I do not think that these methods can
>>> handle missing data.  The most important issue is to think about the
>>> goal
>>> of
>>> variable reduction and then choose an appropriate optimality criterion
>>> for
>>> achieving that goal.  In most instances of variable selection, the
>>> criterion
>>> that is optimized is never explicitly considered.
>>>
>>> Ravi.
>>>
>>> ----------------------------------------------------------------------------
>>> -------
>>>
>>> Ravi Varadhan, Ph.D.
>>>
>>> Assistant Professor, The Center on Aging and Health
>>>
>>> Division of Geriatric Medicine and Gerontology 
>>>
>>> Johns Hopkins University
>>>
>>> Ph: (410) 502-2619
>>>
>>> Fax: (410) 614-9625
>>>
>>> Email: rvaradhan at jhmi.edu
>>>
>>> Webpage: 
>>> http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
>>>
>>>  
>>>
>>> ----------------------------------------------------------------------------
>>> --------
>>>
>>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>> On
>>> Behalf Of Gabor Grothendieck
>>> Sent: Tuesday, December 09, 2008 8:00 AM
>>> To: Harsh
>>> Cc: r-help at r-project.org
>>> Subject: Re: [R] Pre-model Variable Reduction
>>>
>>> See:
>>>
>>> ?prcomp
>>> ?princomp
>>>
>>> On Tue, Dec 9, 2008 at 5:34 AM, Harsh <singhalblr at gmail.com> wrote:
>>>> Hello All,
>>>> I am trying to carry out variable reduction. I do not have information 
>>>> about the dependent variable, and have only the X variables as it 
>>>> were.
>>>> In selecting variables I wish to keep, I have considered the following
>>> criteria.
>>>> 1) Percentage of missing value in each column/variable
>>>> 2) Variance of each variable, with a cut-off value.
>>>>
>>>> I recently came across Weka and found that there is an RWeka package 
>>>> which would allow me to make use of Weka through R.
>>>> Weka provides a "Genetic search" variable reduction method, but I 
>>>> could not find its R code implementation in the RWeka Pdf file on 
>>>> CRAN.
>>>>
>>>> I looked for other R packages that allow me to do variable reduction 
>>>> without considering a dependent variable. I came across 'dprep'
>>>> package but it does not have a Windows implementation.
>>>>
>>>> Moreover, I have a dataset that contains continuous and categorical 
>>>> variables, some categorical variables having 3 levels, 10 levels and 
>>>> so on, till a max 50 levels (E.g. States in the USA).
>>>>
>>>> Any suggestions in this regard will be much appreciated.
>>>>
>>>> Thank you
>>>>
>>>> Harsh Singhal
>>>> Decision Systems,
>>>> Mu Sigma, Inc.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> 
> 
> 
> -- 
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                       Department of Biostatistics   Vanderbilt University
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Pre-model-Variable-Reduction-tp20912229p20919501.html
Sent from the R help mailing list archive at Nabble.com.