[R] Questions about results from PCAproj for robust principal component analysis

Talbot Katz topkatz at msn.com
Wed Feb 14 21:22:55 CET 2007


Professor Filzmoser.

Thank you so much for the detailed response.  It is very helpful.

--  TMK  --
212-460-5430	home
917-656-5351	cell


>From: Peter Filzmoser <P.Filzmoser at tuwien.ac.at>
>To: Talbot Katz <topkatz at msn.com>
>CC: r-help at stat.math.ethz.ch
>Subject: Re: Questions about results from PCAproj for robust principal 
>component analysis
>Date: Wed, 14 Feb 2007 20:43:13 +0100
>
>Hi,
>
>PCAproj is mainly designed for robust PCA and not for classical PCA.
>Therefore, when applying classical estimators to the results of a
>robust PCA, like the mean to the robust PCA scores, this will usually
>not give zeros. The robust PCs have been centred robustly, and
>not classically by the mean.
>In your case you ran PCAproj with the default method="mad", thus
>robust PCA was performed, maximizing the mad instead of the usual
>variance (standard deviation). Moreover, you used the default for
>centring the PCs, which is center="l1median", so a robust centring
>rather than centring by the classical mean (which would have been
>the zero vector because your data were already classically centred).
>Run the procedure with the options
>center=mean and method="sd"
>which then gives classical PCA. The mean of the resulting PCA
>scores will be 0. However, the scores will not be orthogonal
>(or uncorrelated), but close to. The reason is that PCAproj
>is an axxproximative algorithm for finding the (robust) PCs.
>The eigen-analysis of princomp gives the exact solution (but
>it is not robust).
>
>The wrong length of result$scale and result$center is definitely
>an error in the procedure that I will have to change quickly.
>For data sets with more columns than rows we automatically
>apply a singular value decomposition (SVD) to reduce the
>dimensionality (without information loss). Then we perform
>centring and scaling in the space of reduced dimension, but
>we did not back-transform these values - sorry. Will be repaired
>soon.
>
>Best regards,
>Peter
>
>
>Talbot Katz wrote:
>>Hi.
>>
>>I have been looking at the PCAproj function in package pcaPP (R 2.4.1) for 
>>robust principal components, and I'm trying to interpret the results.  I 
>>started with a data matrix of dimensions RxC (R is the number of rows / 
>>observations, C the number of columns / variables).  PCAproj returns a 
>>list of class princomp, similar to the output of the function princomp.  
>>In a case where I can run princomp, I would get the following, from 
>>executing  dmpca = princomp(datamatrix) :
>>-    the vector, sdev, of length C, contains the standard deviations of 
>>the components in
>>         order by descending value; the squares are the eigenvalues of the
>>         covariance matrix
>>-    the matrix, loadings, has dimension CxC, and the columns are the 
>>eigenvectors of the
>>         covariance matrix, in the same order as the sdev vector; the 
>>columns are
>>         orthonormal:
>>         sum(dmpca$loadings[,i]*dmpca$loadings[,j]) = 1 if i == j, ~ 0 if 
>>i != j
>>-    the vector, center, of length C, contains the means of the variable 
>>columns in the original
>>         data matrix, in the same order as the original columns
>>-    the vector, scale, of length C, contains the scalings applied to each 
>>variable, in the same
>>         order as the original columns
>>-    n.obs contains the number of observations used in the computation; 
>>this number equals
>>         R when there is no missing data
>>-    the matrix, scores, has dimension RxC, and it can be thought of as 
>>the projection of the
>>         eigenvector matrix, loadings, back onto the original data; these 
>>columns of
>>         scores are the principal components.  princomp typically removes 
>>the mean,
>>         so the formula is:
>>         dmpca$scores = t(t(datamatrix) - dmpca$center)%*%dmpca$loadings
>>         and apply(dmpca$scores,2,mean) returns a length C vector of 
>>(effectively)
>>         zeroes; also the principal components (columns of scores) are 
>>orthogonal
>>         (but not orthonormal):
>>         sum(dmpca$scores[,i]*dmpca$scores[,j]) ~ 0 if i != j, > 0 if i == 
>>j
>>-    call contains the function call, in this case princomp(x = 
>>datamatrix)
>>
>>That is all as it should be.
>>
>>
>>In my case R < C, which produces singular results for standard PCA, but 
>>robust methods, like PCAproj, are designed to handle this.  Also, I had 
>>"de-meaned" the data beforehand, so apply(datamatrix,2,mean) produces a 
>>length C vector of (effectively) zeroes.  I ran the following:
>>dmpcaprj=PCAproj(datamatrix,k=4,CalcMethod="sphere",update=TRUE)
>>to get the first four robust components.  When I look at the princomp 
>>object returned as dmpcaprj, some of the results are just what I expect.  
>>For example,
>>-    dmpcaprj$loadings has dimensions Cx4, as expected, and the first four 
>>eigenvectors of
>>         the (robust) covariance matrix are orthonormal:
>>         sum(dmpcaprj$loadings[,i]*dmpcaprj$loadings[,j]) = 1 if i == j, ~ 
>>0 if i != j
>>-    dmpcaprj$sdev contains the square roots of the four corresponding 
>>eigenvalues.
>>-    dmpcaprj$n.obs equals R.
>>-    dmpcaprj$scores has dimensions Rx4, as it should.
>>
>>HOWEVER, the columns of dmpcaprj$scores are neither de-meaned nor 
>>orthogonal.  So,
>>         apply(dmpcaprj$scores,2,mean) is a non-zero vector, and
>>         sum(dmpcaprj$scores[,i]*dmpcaprj$scores[,j]) != 0 if i != j, > 0 
>>if i == j
>>ALSO,
>>-    dmpcaprj$scale is in this case a vector of all 1's, as expected.  But 
>>the length is C, not R.
>>-    dmpcaprj$center is a vector of length C, not R, and the entries are 
>>not equal to either
>>         apply(datamatrix,1,mean)  or  apply(datamatrix,2,mean); I can't 
>>figure out
>>         where they came from.
>>
>>One interesting thing is that the columns of the Rx4 matrix,
>>         dmpcaprj$scores - datamatrix%*%dmpcaprj$loadings
>>are all identically constant vectors, such that each row equals 
>>apply(dmpcaprj$scores,2,mean), since
>>apply(datamatrix%*%dmpcaprj$loadings,2,mean) is a length four vector of 
>>(effectively) zeroes,
>>but I can't interpret the values of these means of dmpcaprj$scores.
>>
>>
>>Can anyone please explain to me what is happening with the scores, scale, 
>>and center parts of the PCAproj results?
>>
>>
>>Thanks!
>>
>>
>>--  TMK  --
>>212-460-5430    home
>>917-656-5351    cell
>>
>>
>>
>>
>
>
>--
>-------------------------------------------------------
>From: Prof. Dr. Peter Filzmoser
>       Dept. of Statistics & Probability Theory
>       Vienna University of Technology
>       Wiedner Hauptstrasse 8-10
>       A-1040 Vienna, Austria
>       Tel. +43 1 58801/10733
>       Fax. +43 1 58801/10799
>       E-mail: P.Filzmoser at tuwien.ac.at
>       Internet:
>       http://www.statistik.tuwien.ac.at/public/filz/
>-------------------------------------------------------
>
>
>



More information about the R-help mailing list