[R] Error in princomp?

Jari Oksanen jarioksa at sun3.oulu.fi
Fri Oct 4 08:43:56 CEST 2002


tlumley at u.washington.edu said:
> On Thu, 3 Oct 2002, [iso-8859-1] Peter Vikström wrote:
> Hello! >
> When using princomp() for principal components analysis, the resulting
> loadings matrix differs between R and the output from other programs
> (S-plus
> and SAS)
> The difference is that the sign of some columns in the loadings
> matrix. The
> absolute values, however, are the same.
> This sign difference also exists when comparing the princomp() results
> with
> the manually calculated eigen vectors using eigen().
> Is this a bug in R?

> No. The results are only defined up to a sign change. 

A more intriguing feature in R is that its two alternative PCAs in mva differ 
slightly and report slightly different square roots of eigenvalues:

(here I use my own data set, but the result are similar with any you have:)

> princomp(varespec)$sdev[1:5]
   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5 
30.692367 21.094028 11.257890  8.417428  6.811818 
> prcomp(varespec)$sdev[1:5]
[1] 31.352493 21.547715 11.500023  8.598469  6.958325

The reason is, of course, that `prcomp' opts for unbiased variance whereas 
`princomp' works hard to get biased variances, but of course we can turn that 
off:

>  princomp(varespec)$sdev[1:5]/sqrt(1 - 1/nrow(varespec))
   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5 
31.352493 21.547715 11.500023  8.598469  6.958325 

The practise in prcomp seems to be similar to the normal one, so that its 
eigenvalues add up to variance:

> sum(apply(varespec, 2, var))
[1] 1825.659
> sum(princomp(varespec)$sdev^2)
[1] 1749.590
> sum(prcomp(varespec)$sdev^2)
[1] 1825.659

However, princomp seems to work hard to get biased estimates of covariance:

 cv <- covmat$cov * (1 - 1/n.obs) # in princomp.default
(here covmat$cov is the matrix of unbiased covariances)

whereas it doesn't do the same if analysis is based on correlations:

    if (cor) { # in princomp.default
        sds <- sqrt(diag(cv))
        cv <- cv/(sds %o% sds)
    }

Whereas prcomp is just as explicit in opting for unbiased variances, although it
has to do this after extracting the singular values (s$d) which otherwise would
be related to sums of squares instead of variances:

    s$d <- s$d/sqrt(max(1, nrow(x) - 1))

Personally, I'd opt for the prcomp alternative (which is the official 
recommendation anyhow).

Except for sign, the PC solutions are indeterminate for scale (scores at least,
the loadings are habitually orthonormal == rotation matrix). It seems that R
opts for slight inconsistency, though. Square root of eigenvalues are geared for
variances, but the scores for sums of squares. 

> sum(prcomp(varespec)$sdev^2)
[1] 1825.659
> sum(prcomp(varespec)$x^2) 
[1] 41990.17
> sum(prcomp(varespec)$x^2)/(nrow(varespec)-1)
[1] 1825.659

The current scaling is just as correct as the alternatives, of course, but not 
concordant with the square root of eigenvalues: eigenvalues are divided by df 
after extraction, but scores are not. This can be turned of by dividing the 
scores by sqrt(nrow(x)-1) -- which of course is a constant for any matrix x.

Peter Vikström's message alone didn't make me to study PCA output in R this
closely, but I just confronted a really weird scaling in another software and
trying to understand that one I looked at R as well. I am just implementing Cajo
ter Braak's "Redundancy Analysis" for ecological community data in R (a spin-off
from his "Canonical Correspondence Analysis"), but I couldn't get similar
species scores as in his software Canoco, although my results were similar to
PCAs in R. Canoco works like it would use covariance matrix or svd of centred
data -- it does neither, but something more obscure, but it looks like this from
the surface. However, the species scores (which are relative to eigenvalues:
site scores are orthonormal) are divided by the standard deviation of each
species. So they are the square roots of ``variance explained by axis'' for each
species, or look like coming from correlation matrix although the analysis is
based on covariance matrix. I won't implement that behaviour, but that's 
an interesting alternative.

This message really was a digression...

cheers, jari oksanen 
-- 
Jari Oksanen -- Dept Biology, Univ Oulu, 90014 Oulu, Finland
Ph. +358 8 5531526, cell +358 40 5136529, fax +358 8 5531061
email jari.oksanen at oulu.fi, homepage http://cc.oulu.fi/~jarioksa/


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list