[R] Question about PCA with prcomp

Mark Difford mark_difford at yahoo.co.uk
Mon Jul 2 22:44:29 CEST 2007


Hi James, Ravi:

James wrote:
...
>> I have 20 "entities" for which I have ~500 measurements each. So, I   
>> have a matrix of 20 rows by ~500 columns.
...

Perhaps I misread James' question, but I don't think so.  As James described
it, we have ~500 measurements made on 20 objects.  A PCA on this [20
rows/observations by ~ 500 columns/descriptors/variables] should return ~
500 eigenvalues.  And each of these columns/descriptors/variables will have
a loading on each PC.

James wants to reduce his descriptors/measurements/variables to the "most
important" (variance).  A primitive way of doing this would be to examine
the loadings on the first 2--3 PCs and choose those
columns/descriptors/variables with the highest loadings, and throw away the
rest.  [He has already decided that he can throw away all but the first two
PCs.]  In fact, it would be a very good idea to do a coinertia analysis on
the pre- and post-selected sets, and look at the RV value.  If this is above
[thumbsuck] 0.9, then you're doing very well (there's a good plot method for
this in ade4, cf coinertia &c).

But see Cadima et al. (+refs for caution; and elsewhere) for more
sophisticated methods of subsetting.

Regards,
Mark.


Ravi Varadhan wrote:
> 
> Mark,
> 
> What you are referring to deals with the selection of covariates, since PC
> doesn't do dimensionality reduction in the sense of covariate selection.
> But what Mark is asking for is to identify how much each data point
> contributes to individual PCs.  I don't think that Mark's query makes much
> sense, unless he meant to ask: which individuals have high/low scores on
> PC1/PC2.  Here are some comments that may be tangentially related to
> Mark's
> question:
> 
> 1.  If one is worried about a few data points contributing heavily to the
> estimation of PCs, then one can use robust PCA, for example, using robust
> covariance matrices.  MASS has some tools for this.
> 2.  The "biplot" for the first 2 PCs can give some insights
> 3. PCs, especially, the last few PCs, can be used to identify "outliers".
>   
> Hope this is helpful,
> Ravi.
> 
> ----------------------------------------------------------------------------
> -------
> 
> Ravi Varadhan, Ph.D.
> 
> Assistant Professor, The Center on Aging and Health
> 
> Division of Geriatric Medicine and Gerontology 
> 
> Johns Hopkins University
> 
> Ph: (410) 502-2619
> 
> Fax: (410) 614-9625
> 
> Email: rvaradhan at jhmi.edu
> 
> Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
> 
>  
> 
> ----------------------------------------------------------------------------
> --------
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Mark Difford
> Sent: Monday, July 02, 2007 1:55 PM
> To: r-help at stat.math.ethz.ch
> Subject: Re: [R] Question about PCA with prcomp
> 
> 
> Hi James,
> 
> Have a look at Cadima et al.'s subselect package [Cadima worked with/was a
> student of Prof Jolliffe, one of _the_ experts on PCA; Jolliffe devotes
> part
> of a Chapter to this question in his text (Principal Component Analysis,
> pub. Springer)].  Then you should look at psychometric stuff: a good place
> to start would be Professor Revelle's psych package.
> 
> BestR,
> Mark.
> 
> 
> James R. Graham wrote:
>> 
>> Hello All,
>> 
>> The basic premise of what I want to do is the following:
>> 
>> I have 20 "entities" for which I have ~500 measurements each. So, I  
>> have a matrix of 20 rows by ~500 columns.
>> 
>> The 20 entities fall into two classes: "good" and "bad."
>> 
>> I eventually would like to derive a model that would then be able to  
>> classify new entities as being in "good territory" or "bad territory"  
>> based upon my existing data set.
>> 
>> I know that not all ~500 measurements are meaningful, so I thought  
>> the best place to begin would be to do a PCA in order to reduce the  
>> amount of data with which I have to work.
>> 
>> I did this using the prcomp function and found that nearly 90% of the  
>> variance in the data is explained by PC1 and 2.
>> 
>> So far, so good.
>> 
>> I would now like to find out which of the original ~500 measurements  
>> contribute to PC1 and 2 and by how much.
>> 
>> Any tips would be greatly appreciated! And apologies in advance if  
>> this turns out to be an idiotic question.
>> 
>> 
>> james
>> 
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a1139860
> 8
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a11401504
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list