[R] PCA: eigen/princomp vs. svd/prcomp

Sat Jan 21 00:02:10 CET 2006

I am using R 2.2.1 on OS X 10.4.4. I have a question that is partly  
about R but also about some differences in the loadings when doing  
principal components using eigen()/princomp() versus prcomp() . Here  
is the story:

I have a matrix of mean monthly temperatures for 26 sites in the  
northern and southern hemispheres (26 x 12). I am using PCA to reduce  
this to one or two variables that capture most of the annual  
temperature variation among these sites. I am particularly interested  
in a single vector that captures the overall annual differences among  
sites. The southern hemisphere sites are 6 months out of phase with  
the northern, in terms of seasons. So the first question is whether  
or not to rotate the southern hemisphere data so that Jan=July,  
Feb=Aug, etc. before PCA. The second question is whether or not to  
center and scale the data. My gut feeling is, no, as these are all  
temperatures and the differences in means and variances among months  
are important.

If I do PCA using eigen()/princomp() on the unrotated, unscaled, and  
uncentered data, the first PC explains about 60% of the variation and  
represents the difference in phase between the southern and northern  
hemispheres. The second PC represents mean temperature and explains  
about 35% of the variation.

If I use prcomp() on the unrotated, unscaled, and uncentered data,  
the first PC represents mean temperature and explains >90% of the  
variation, the second represents the seasonal phase difference and  
explains less than 5% of the variation. This surprised me, as  
intuitively I had expected the seasonal phase difference to fall out  
first, as it did using eigen(). If anyone has an explanation for  
this, I would love to hear it.

If I center the data, the two methods yield nearly identical results,  
with the first PC capturing the seasonal phase difference and the  
second the mean, explaining 60% and 30% of the variances  
respectively. My intuition (which often is wrong...) says that this  
is not the right way to do things in this case.

I love the result from prcomp() using the uncentered, unscaled data,  
but the loadings are so different from the eigenvectors. I am  
suspicious that something funky is going on here. Does not centering  
the data cause a problem with the math? I would appreciate any comments.

If I rotate the southern hemisphere data six months out of phase,  
then the first PC by either method represents mean temperature and  
the second captures the seasonal difference but again separates the  
northern and southern hemispheres. The variance explained by the  
first PC is about 75% using eigen() and 97% using prcomp(). On one  
hand, this seems like a sensible approach, however it is pretty  
manipulative of the data. March in Santiago probably is NOT the same  
as September in San Francisco, as is reflected in the second PC. But  
again the two methods yield very different amounts of variance  
explained. Why?

Any thoughts would be very much appreciated!

cheers, George

..................................................................
George W. Gilchrist                        Email #1: gwgilc at wm.edu
Department of Biology, Box 8795          Email #2: kitesci at cox.net
College of William & Mary                    Phone: (757) 221-7751
Williamsburg, VA 23187-8795                    Fax: (757) 221-6483
http://gwgilc.people.wm.edu/