[R] PCA: eigen/princomp vs. svd/prcomp
George W. Gilchrist
gwgilc at wm.edu
Fri Jan 20 15:35:01 CET 2006
I am using R 2.2.1 on OS X 10.4.4. I have a question that is partly
about R but also about some differences in the loadings when doing
principal components using eigen()/princomp() versus prcomp() . Here
is the story:
I have a matrix of mean monthly temperatures for 26 sites in the
northern and southern hemispheres (26 x 12). I am using PCA to reduce
this to one or two variables that capture most of the annual
temperature variation among these sites. I am particularly interested
in a single vector that captures the overall annual differences among
sites. The southern hemisphere sites are 6 months out of phase with
the northern, in terms of seasons. So the first question is whether
or not to rotate the southern hemisphere data so that Jan=July,
Feb=Aug, etc. before PCA. The second question is whether or not to
center and scale the data. My gut feeling is, no, as these are all
temperatures and the differences in means and variances among months
are important.
If I do PCA using eigen()/princomp() on the unrotated, unscaled, and
uncentered data, the first PC explains about 60% of the variation and
represents the difference in phase between the southern and northern
hemispheres. The second PC represents mean temperature and explains
about 35% of the variation.
If I use prcomp() on the unrotated, unscaled, and uncentered data,
the first PC represents mean temperature and explains >90% of the
variation, the second represents the seasonal phase difference and
explains less than 5% of the variation. This surprised me, as
intuitively I had expected the seasonal phase difference to fall out
first, as it did using eigen(). If anyone has an explanation for
this, I would love to hear it.
If I center the data, the two methods yield nearly identical results,
with the first PC capturing the seasonal phase difference and the
second the mean, explaining 60% and 30% of the variances
respectively. My intuition (which often is wrong...) says that this
is not the right way to do things in this case.
I love the result from prcomp() using the uncentered, unscaled data,
but the loadings are so different from the eigenvectors. I am
suspicious that something funky is going on here. Does not centering
the data cause a problem with the math? I would appreciate any comments.
If I rotate the southern hemisphere data six months out of phase,
then the first PC by either method represents mean temperature and
the second captures the seasonal difference but again separates the
northern and southern hemispheres. The variance explained by the
first PC is about 75% using eigen() and 97% using prcomp(). On one
hand, this seems like a sensible approach, however it is pretty
manipulative of the data. March in Santiago probably is NOT the same
as September in San Francisco, as is reflected in the second PC. But
again the two methods yield very different amounts of variance
explained. Why?
Any thoughts would be very much appreciated!
cheers, George
..................................................................
George W. Gilchrist Email #1: gwgilc at wm.edu
Department of Biology, Box 8795 Email #2: kitesci at cox.net
College of William & Mary Phone: (757) 221-7751
Williamsburg, VA 23187-8795 Fax: (757) 221-6483
http://gwgilc.people.wm.edu/
More information about the R-help
mailing list