[R] Questions about results from PCAproj for robust principal component analysis
Talbot Katz
topkatz at msn.com
Tue Feb 13 21:33:10 CET 2007
Hi.
I have been looking at the PCAproj function in package pcaPP (R 2.4.1) for
robust principal components, and I'm trying to interpret the results. I
started with a data matrix of dimensions RxC (R is the number of rows /
observations, C the number of columns / variables). PCAproj returns a list
of class princomp, similar to the output of the function princomp. In a
case where I can run princomp, I would get the following, from executing
dmpca = princomp(datamatrix) :
- the vector, sdev, of length C, contains the standard deviations of the
components in
order by descending value; the squares are the eigenvalues of the
covariance matrix
- the matrix, loadings, has dimension CxC, and the columns are the
eigenvectors of the
covariance matrix, in the same order as the sdev vector; the columns are
orthonormal:
sum(dmpca$loadings[,i]*dmpca$loadings[,j]) = 1 if i == j, ~ 0 if i != j
- the vector, center, of length C, contains the means of the variable
columns in the original
data matrix, in the same order as the original columns
- the vector, scale, of length C, contains the scalings applied to each
variable, in the same
order as the original columns
- n.obs contains the number of observations used in the computation; this
number equals
R when there is no missing data
- the matrix, scores, has dimension RxC, and it can be thought of as the
projection of the
eigenvector matrix, loadings, back onto the original data; these columns
of
scores are the principal components. princomp typically removes the mean,
so the formula is:
dmpca$scores = t(t(datamatrix) - dmpca$center)%*%dmpca$loadings
and apply(dmpca$scores,2,mean) returns a length C vector of (effectively)
zeroes; also the principal components (columns of scores) are orthogonal
(but not orthonormal):
sum(dmpca$scores[,i]*dmpca$scores[,j]) ~ 0 if i != j, > 0 if i == j
- call contains the function call, in this case princomp(x = datamatrix)
That is all as it should be.
In my case R < C, which produces singular results for standard PCA, but
robust methods, like PCAproj, are designed to handle this. Also, I had
"de-meaned" the data beforehand, so apply(datamatrix,2,mean) produces a
length C vector of (effectively) zeroes. I ran the following:
dmpcaprj=PCAproj(datamatrix,k=4,CalcMethod="sphere",update=TRUE)
to get the first four robust components. When I look at the princomp object
returned as dmpcaprj, some of the results are just what I expect. For
example,
- dmpcaprj$loadings has dimensions Cx4, as expected, and the first four
eigenvectors of
the (robust) covariance matrix are orthonormal:
sum(dmpcaprj$loadings[,i]*dmpcaprj$loadings[,j]) = 1 if i == j, ~ 0 if i
!= j
- dmpcaprj$sdev contains the square roots of the four corresponding
eigenvalues.
- dmpcaprj$n.obs equals R.
- dmpcaprj$scores has dimensions Rx4, as it should.
HOWEVER, the columns of dmpcaprj$scores are neither de-meaned nor
orthogonal. So,
apply(dmpcaprj$scores,2,mean) is a non-zero vector, and
sum(dmpcaprj$scores[,i]*dmpcaprj$scores[,j]) != 0 if i != j, > 0 if i == j
ALSO,
- dmpcaprj$scale is in this case a vector of all 1's, as expected. But the
length is C, not R.
- dmpcaprj$center is a vector of length C, not R, and the entries are not
equal to either
apply(datamatrix,1,mean) or apply(datamatrix,2,mean); I can't figure out
where they came from.
One interesting thing is that the columns of the Rx4 matrix,
dmpcaprj$scores - datamatrix%*%dmpcaprj$loadings
are all identically constant vectors, such that each row equals
apply(dmpcaprj$scores,2,mean), since
apply(datamatrix%*%dmpcaprj$loadings,2,mean) is a length four vector of
(effectively) zeroes,
but I can't interpret the values of these means of dmpcaprj$scores.
Can anyone please explain to me what is happening with the scores, scale,
and center parts of the PCAproj results?
Thanks!
-- TMK --
212-460-5430 home
917-656-5351 cell
More information about the R-help
mailing list