[R] Question about PCA with prcomp
Mark Difford
mark_difford at yahoo.co.uk
Mon Jul 2 22:44:29 CEST 2007
Hi James, Ravi:
James wrote:
...
>> I have 20 "entities" for which I have ~500 measurements each. So, I
>> have a matrix of 20 rows by ~500 columns.
...
Perhaps I misread James' question, but I don't think so. As James described
it, we have ~500 measurements made on 20 objects. A PCA on this [20
rows/observations by ~ 500 columns/descriptors/variables] should return ~
500 eigenvalues. And each of these columns/descriptors/variables will have
a loading on each PC.
James wants to reduce his descriptors/measurements/variables to the "most
important" (variance). A primitive way of doing this would be to examine
the loadings on the first 2--3 PCs and choose those
columns/descriptors/variables with the highest loadings, and throw away the
rest. [He has already decided that he can throw away all but the first two
PCs.] In fact, it would be a very good idea to do a coinertia analysis on
the pre- and post-selected sets, and look at the RV value. If this is above
[thumbsuck] 0.9, then you're doing very well (there's a good plot method for
this in ade4, cf coinertia &c).
But see Cadima et al. (+refs for caution; and elsewhere) for more
sophisticated methods of subsetting.
Regards,
Mark.
Ravi Varadhan wrote:
>
> Mark,
>
> What you are referring to deals with the selection of covariates, since PC
> doesn't do dimensionality reduction in the sense of covariate selection.
> But what Mark is asking for is to identify how much each data point
> contributes to individual PCs. I don't think that Mark's query makes much
> sense, unless he meant to ask: which individuals have high/low scores on
> PC1/PC2. Here are some comments that may be tangentially related to
> Mark's
> question:
>
> 1. If one is worried about a few data points contributing heavily to the
> estimation of PCs, then one can use robust PCA, for example, using robust
> covariance matrices. MASS has some tools for this.
> 2. The "biplot" for the first 2 PCs can give some insights
> 3. PCs, especially, the last few PCs, can be used to identify "outliers".
>
> Hope this is helpful,
> Ravi.
>
> ----------------------------------------------------------------------------
> -------
>
> Ravi Varadhan, Ph.D.
>
> Assistant Professor, The Center on Aging and Health
>
> Division of Geriatric Medicine and Gerontology
>
> Johns Hopkins University
>
> Ph: (410) 502-2619
>
> Fax: (410) 614-9625
>
> Email: rvaradhan at jhmi.edu
>
> Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
>
>
>
> ----------------------------------------------------------------------------
> --------
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Mark Difford
> Sent: Monday, July 02, 2007 1:55 PM
> To: r-help at stat.math.ethz.ch
> Subject: Re: [R] Question about PCA with prcomp
>
>
> Hi James,
>
> Have a look at Cadima et al.'s subselect package [Cadima worked with/was a
> student of Prof Jolliffe, one of _the_ experts on PCA; Jolliffe devotes
> part
> of a Chapter to this question in his text (Principal Component Analysis,
> pub. Springer)]. Then you should look at psychometric stuff: a good place
> to start would be Professor Revelle's psych package.
>
> BestR,
> Mark.
>
>
> James R. Graham wrote:
>>
>> Hello All,
>>
>> The basic premise of what I want to do is the following:
>>
>> I have 20 "entities" for which I have ~500 measurements each. So, I
>> have a matrix of 20 rows by ~500 columns.
>>
>> The 20 entities fall into two classes: "good" and "bad."
>>
>> I eventually would like to derive a model that would then be able to
>> classify new entities as being in "good territory" or "bad territory"
>> based upon my existing data set.
>>
>> I know that not all ~500 measurements are meaningful, so I thought
>> the best place to begin would be to do a PCA in order to reduce the
>> amount of data with which I have to work.
>>
>> I did this using the prcomp function and found that nearly 90% of the
>> variance in the data is explained by PC1 and 2.
>>
>> So far, so good.
>>
>> I would now like to find out which of the original ~500 measurements
>> contribute to PC1 and 2 and by how much.
>>
>> Any tips would be greatly appreciated! And apologies in advance if
>> this turns out to be an idiotic question.
>>
>>
>> james
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> --
> View this message in context:
> http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a1139860
> 8
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a11401504
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list