[R] Pre-model Variable Reduction
Mark Difford
mark_difford at yahoo.co.uk
Tue Dec 9 18:32:42 CET 2008
Hi Frank,
>> If anyone knows of better references for this please let me know.
Many thanks: I was not aware of the Witten paper. If I turn up anything else
I will be sure to let you know.
Best Regards, Mark.
Frank E Harrell Jr wrote:
>
> Mark Difford wrote:
>> Hi All,
>>
>> I beg to differ with Ravi Varadhan's perspective. While it is true that
>> principal component analysis does not itself do variable selection, it is
>> an
>> important method for pointing the way to what to select. This is what the
>> methods in the subselect package rely on. (One of its authors was I
>> believe
>> a student of Jolliffe's). For a modern perspective on this, see the
>> following paper:
>>
>> Debashis Paul, Eric Bair, Trevor Hastie and Robert Tibshirani:
>> "Preconditioning" for feature selection and regression in
>> high-dimensional
>> problems We show that supervised principal components followed by a
>> variable
>> selection procedure is an effective approach for variable selection in
>> very
>> high dimension. Annals of Statistics 36(4), 2008, 1595-1618.
>>
>> http://www-stat.stanford.edu/~hastie/Papers/Preconditioning_Annals.pdf
>>
>> Regards, Mark.
>
> Mark,
>
> Slightly more relevant is the unsupervised sparse principal component
> methods described in the following references. If anyone knows of
> better references for this please let me know. -Frank
>
>
> @Article{zou06spa,
> author = {Zhou, Hui and Hastie, Trevor and Tibshirani, Robert},
> title = {Sparse principal component analysis},
> journal = J Comp Graph Stat,
> year = 2006,
> volume = 15,
> pages = {265-286},
> annote = {gene microarray;lasso/elastic net;multivariate
> analysis;data reduction;singular value
> decomposition;thresholding;principal components analysis that shrinks
> some loadings to zero}
> }
> @Article{wit08tes,
> author = {Witten, Daniela M. and Tibshirani, Robert},
> title = {Testing significance of features by lassoed principal
> components},
> journal = Annals Appl Stat,
> year = 2008,
> volume = 2,
> number = 3,
> pages = {986-1012},
> annote = {reduction in false discovery rates over using a vector of
> t-statistics;borrowing strength across genes;``one would not expect a
> single gene to be associated with the outcome, since, in practice, many
> genes work together to effect a particular phenotype. LPC effectively
> down-weights individual genes that are associated with the outcome but
> that do not share an expression pattern with a larger group of genes,
> and instead favors large groups of genes that appear to be
> differentially-expressed.'';regress principal components on outcome}
> }
>
>>
>>
>> Ravi Varadhan wrote:
>>> Principal components analysis does "dimensionality reduction" but NOT
>>> "variable reduction". However, Jolliffe's 2004 book on PCA does discuss
>>> the
>>> problem of selecting a subset of variables, with the goal of
>>> representing
>>> the internal variation of original multivariate vector as well as
>>> possible
>>> (see Section 6.3 of that book). I do not think that these methods can
>>> handle missing data. The most important issue is to think about the
>>> goal
>>> of
>>> variable reduction and then choose an appropriate optimality criterion
>>> for
>>> achieving that goal. In most instances of variable selection, the
>>> criterion
>>> that is optimized is never explicitly considered.
>>>
>>> Ravi.
>>>
>>> ----------------------------------------------------------------------------
>>> -------
>>>
>>> Ravi Varadhan, Ph.D.
>>>
>>> Assistant Professor, The Center on Aging and Health
>>>
>>> Division of Geriatric Medicine and Gerontology
>>>
>>> Johns Hopkins University
>>>
>>> Ph: (410) 502-2619
>>>
>>> Fax: (410) 614-9625
>>>
>>> Email: rvaradhan at jhmi.edu
>>>
>>> Webpage:
>>> http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
>>>
>>>
>>>
>>> ----------------------------------------------------------------------------
>>> --------
>>>
>>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>> On
>>> Behalf Of Gabor Grothendieck
>>> Sent: Tuesday, December 09, 2008 8:00 AM
>>> To: Harsh
>>> Cc: r-help at r-project.org
>>> Subject: Re: [R] Pre-model Variable Reduction
>>>
>>> See:
>>>
>>> ?prcomp
>>> ?princomp
>>>
>>> On Tue, Dec 9, 2008 at 5:34 AM, Harsh <singhalblr at gmail.com> wrote:
>>>> Hello All,
>>>> I am trying to carry out variable reduction. I do not have information
>>>> about the dependent variable, and have only the X variables as it
>>>> were.
>>>> In selecting variables I wish to keep, I have considered the following
>>> criteria.
>>>> 1) Percentage of missing value in each column/variable
>>>> 2) Variance of each variable, with a cut-off value.
>>>>
>>>> I recently came across Weka and found that there is an RWeka package
>>>> which would allow me to make use of Weka through R.
>>>> Weka provides a "Genetic search" variable reduction method, but I
>>>> could not find its R code implementation in the RWeka Pdf file on
>>>> CRAN.
>>>>
>>>> I looked for other R packages that allow me to do variable reduction
>>>> without considering a dependent variable. I came across 'dprep'
>>>> package but it does not have a Windows implementation.
>>>>
>>>> Moreover, I have a dataset that contains continuous and categorical
>>>> variables, some categorical variables having 3 levels, 10 levels and
>>>> so on, till a max 50 levels (E.g. States in the USA).
>>>>
>>>> Any suggestions in this regard will be much appreciated.
>>>>
>>>> Thank you
>>>>
>>>> Harsh Singhal
>>>> Decision Systems,
>>>> Mu Sigma, Inc.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>
>
> --
> Frank E Harrell Jr Professor and Chair School of Medicine
> Department of Biostatistics Vanderbilt University
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://www.nabble.com/Pre-model-Variable-Reduction-tp20912229p20919501.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list