prcomp & princomp - revised

Wed, 26 Aug 1998 14:33:11 -0400

My previous post about prcomp and princomp was done in some haste as I had long
ago indicated to Kurt that I would try to have this ready for the June release,
and it appeared that I would miss yet another release. I also need to get it out
before it becomes hopelessly buried by other work.

Brian Ripley kindly pointed out some errors, and also pointed out that I was
suggesting replacing some functions which were already working well in R. Other
than changing prcomp and princomp this was unintentional on my part. It seems
that some of these functions have either appeared since I first started this, or
that I was just careless (hard to imagine but it happens) and perhaps checked
only for prcomp methods and not princomp methods. In any event, I decided it was
necessary to check and indicate the changes I am proposing more carefully,
despite my time pressures, and also, first of all, to indicate my interest and
intentions more clearly.

I have no special attachment to principal components and there are many people
on this list who could do a better job at this than I can. I want to use PCA for
another problem I am working on, and I noticed several months ago that prcomp
and princomp were returning different results in R and Splus. Unless this is
fixed it puts me in the difficult position of having to choose between Splus and
R, which I would prefer not to do for reasons I have expressed from time to
time. (Another solution is to over-ride the mva library with my own functions,
but this has undesirable consequences when others use my library.) My intention
at this point is to submit these changes to the mva library and then not do any
more work on it. I am not suggesting that these methods are better, only that
they are more consistent with Splus. There is room for improvement (in both
Splus and R) and perhaps someone else would like to work on it. Of all the
changes below, the only ones I consider really important are the documentation
and the new version of prcomp which gives results like Splus.

I have some misgivings about adding yet another principal components function
(prcomponents below). However, I would like prcomp and princomp to return the
same results as Splus, and yet I see that there is need for improvement. I would
like this improvement to happen in a function which has a different name from
the functions in Splus, and perhaps be moved into an expanded "compatibility
library" at some point.

Paul Gilbert

_______

The code below makes the following changes relative to R Version 0.62.3 in
progress-release (August 19, 1998).

- prcomp is changed so that it takes similar arguments and usually reproduces
      Splus results. The criterion used for determining if a singular value
      is zero (following suggestions of Martin Maechler) is slightly different
      and this can occasionally give a different result.
- prcomp.Rd is changed to reflect the changes to prcomp
- summary.prcomp is defined to give a result more like summary.princomp. In
      Splus summary.prcomp does not exist and the default method is applied,
      but this difference seems reasonable?
- screeplot is made generic
- function(...) is changed to function(x, ...) for some plot methods (to be
         consistent with plot).
- print.prcomp is changed to be more like the result from Splus (Splus just
     uses print.default, so this is slightly different)
- plot.prcomp is renamed screeplot.prcomp to be consistent with the way
     this is done for princomp, but I have not worked on this
     function and it does not appear to do scaling or other nice things
     like screeplot.princomp. (Hopefully Brian Ripley will volunteer
     to fix it.)
- plot.prcomp calls - screeplot.prcomp

- princomp is still the function I submitted previously but has a
     few additional comments. It reproduces Splus results when the
     argument is a data matrix, but does not work with a covariance
     argument and the Splus version does.
- princomp.Rd is new.
- print.princomp is defined in princomp.R. This version is a bit more like
    the result from Splus and the function of the same name is removed
    from princomp-add.R.

- programs in the file princomp-add.R could be included in princomp.R but I
    was unsure how to deal with the V&R copyright. Perhaps it should
    be included in each function? I've left princomp-add.R as a
    separate file and done only the minimal number of changes necessary to
    make functions work with the changes in other files. (As far as my
    contributions in the other files are concerned I am happy to assign
    copyright to "The R Development Core Team".)

- a tentative and so far undocumented function prcomponents in the file
    prcomponents.R has been added. This is an incomplete attempt to deal
    with shortcomings of prcomp and princomp, the main problems with those
    functions being that the preferred computational method, svd, is used
    in prcomp, but princomp returns additional useful information. The
    function princomp uses eigen so that it returns results consistent
    with Splus, but does not yet support a covariance as an argument,
    which Splus does (and is the reason eigen is used). My function
    prcomponents also tries to give the user the ability to control
    what is used in the normalization (e.g. N or N-1), as suggested by
    Bill Venables. It does not include use="all.obs" which was in the
    prcomp I am proposing to replace, but it should. The difficulty is
    that this argument is passed to cor or cov, which are not
    called if svd is used.

############## prcomp.R replacement file ##############

screeplot    <- function(x, ...)  {UseMethod("screeplot")}

plot.prcomp   <- function(x, ...) {screeplot(x, ...)}

prcomp <- function(x, retx=TRUE, center=TRUE, scale=FALSE) {
# s <- svd(scale(x, center=center, scale=scale),nu=0)
# above produces warning since scale is both a function and a variable so
 s <- svd(get("scale",envir=.GlobalEnv)
   (x, center=center, scale=scale),nu=0)
# rank <- sum(s$d > 0)
 rank <- sum(s$d > (s$d[1]*sqrt(.Machine$double.eps)))
 if (rank < ncol(x)) s$v <- s$v[,1:rank, drop=F]
 s$d <- s$d/sqrt(max(1,nrow(x) -1))
 if(retx) r <- list(sdev=s$d, rotation=s$v, x=as.matrix(x) %*% s$v)
 else     r <- list(sdev=s$d, rotation=s$v)
 class(r) <- "prcomp"
 r
}

print.prcomp <- function(x) {
 cat("Standard deviations:\n")
 print(x$sdev)
 cat("\nRotation:\n")
 print(x$rotation)
 if (!is.null(x$x))
  {cat("\nRotated variables:\n")
   print(x$x)
  }
 invisible(x)
}

summary.prcomp <- function(obj, digits=3 )
{ vars <- obj$sdev^2
  vars <- vars/sum(vars)
  cat("Importance of components:\n")
  print(rbind("Standard deviation" = obj$sdev,
              "Proportion of Variance" = vars,
              "Cumulative Proportion" = cumsum(vars)))
  invisible(obj)
}

screeplot.prcomp <- function(x, main="Scree Plot", ylab="Variance",
  xlab="Component", ...) {
 plot(x$var, main=main, xlab=xlab, ylab=ylab, ...)
}

############### prcomp.Rd replacement file ###############

\name{prcomp}
\title{Principal Components Analysis}
\usage{
prcomp(x, retx=TRUE, center=TRUE, scale=FALSE)

}
\arguments{
\item{x}{a matrix (or data frame) which provides the data
for the principal components analysis.}
\item{retx}{a logical value indicating whether the rotated variables should
be returned.}
\item{center}{a logical value indicating whether the variables should
be shifted to be zero centered.}
\item{scale}{a logical value indicating whether the variables should
be scaled to have unit variance before the analysis takes place. The default
is F for consistency with S, but in general scaling is advisable.}
}
\description{
This function performs a principal components analysis on
the given data matrix and returns the results as a
\code{prcomp} object. The calculation is done with svd on the data matrix, not
by using eigen on the covariance matrix. This is generally the preferred method
for numerical accuracy.  The print method for the these
objects prints the results in a nice format and the
plot method produces a scree plot. The method \code{biplot} plots two selected
components against one another.
}
\value{
\code{prcomp} returns an list with class \code{"prcomp"}
containing the following components:
\item{sdev}{the standard deviation of the principal components
(i.e. the eigenvalues of the cov matrix - though the calculation is actually
done with the singular values of the data matrix.)}
\item{rotation}{the matrix of variable loadings (i.e. a matrix
whose columns contain the eigenvectors). The function princomp returns this
in the element \code{loadings}.}
\item{x}{if retx is true the value of the rotated data (the data multiplied by
the \code{rotation} matrix) is returned.}
}
\references{
Mardia, K. V., J. T. Kent, J and M. Bibby (1979),
\emph{Multivariate Analysis}, London: Academic Press.

Venerables, W. N. and B. D. Ripley (1994).
\emph{Modern Applied Statistics with S-Plus}, Springer-Verlag.
}

\seealso{
\code{\link{princomp}}, \code{\link{prcomponents}}, \code{\link{cor}},
\code{\link{cov}},
\code{\link{svd}}, \code{\link{eigen}}.
}
\examples{
# the variances of the variables in the
# crimes data vary by orders of magnitude
data(crimes)
prcomp(crimes)
prcomp(crimes,scale=TRUE)
plot(prcomp(crimes))
summary(prcomp(crimes))
}

############## princomp.R replacement file ##############

princomp <- function(x, cor=FALSE, scores=TRUE,
  subset=rep(TRUE, nrow(as.matrix(x)))) {
 # it is tempting to add  use="all.obs" which could be passed to cov or
 # cor but then the calculation of N is complicated.
 z<- as.matrix(x)[subset,, drop=F]
 N <- nrow(z)
 if(cor) cv <- get("cor",envir=.GlobalEnv)(z)
 else    cv <- cov(z)
#  (svd can be used but gives different signs for some vectors)
 edc <- eigen(cv)
 cn <- paste("Comp.", 1:ncol(cv), sep="")
 names(edc$values) <- cn
 dimnames(edc$vectors) <- list(dimnames(x)[[2]], cn)
 scr<- NULL
 if (cor)
   {sdev <- sqrt(edc$values)
    sc <- (apply(z,2,var)*(N-1)/N)^0.5
    if (scores)
        scr<-(scale(z,center=T,scale=T) %*% edc$vectors)*sqrt(N/(N-1))
   }
 else
   {sdev <- sqrt(edc$values*(N-1)/N)
    sc <- rep(1, ncol(z))
    if (scores)
        scr<- (scale(z,center=T,scale=F) %*% edc$vectors)
   }
 names(sc) <- dimnames(x)[[2]]
 edc <-list(sdev=sdev, loadings=edc$vectors,
     center=apply(z,2,mean), scale=sc, n.obs=N, scores=scr)
# The Splus function also return list elements factor.sdev, correlations
# and coef, but these are not documented in the help. coef seems to equal
# load. The Splus function also return list elements call and terms which
# are not supported here.
 class(edc) <- "princomp"
 edc
}

print.princomp <- function(x) {
 cat("Standard deviations:\n")
 print(x$sdev)
 cat(length(x$scale), " variables and ", x$n.obs, "observations.\n")
 cat("Scale:\n")
 print(x$scale)
 invisible(x)
}

############### princomp.Rd new file ###############

\name{princomp}
\title{Principal Components Analysis}
\usage{
princomp(x, cor=FALSE, scores=TRUE, subset=rep(TRUE, nrow(as.matrix(x))))
print.princomp(obj)
summary.princomp(obj)
plot.princomp(obj)
}
\alias{print.princomp}
\alias{plot.princomp}
\arguments{
\item{x}{a matrix (or data frame) which provides the data
for the principal components analysis.}
\item{cor}{a logical value indicating whether the calculation should use the
correlation matrix or the covariance matrix.}
\item{score}{a logical value indicating whether the score on each principal
component should be calculated.}
\item{subset}{a vector used to select rows (observations) of the
data matrix \code{x}.}
}
\description{
This function performs a principal components analysis on
the given data matrix and returns the results as a \code{princomp} object.
The calculation is done using \code{eigen} on the correlation or
covariance matrix, as determined by \code{cor}. This is done for compatibility
with the Splus result (even though alternate forms for \code{x} - e.g. a
covariance matrix - are not supported as they are in Splus). A preferred method
of calculation is to use svd on \code{x}, as is done in \code{prcomp}.

Note that the scaling of results is affected by the setting of \code{cor}.
If \code{cor} is T then the divisor in the calculation of the sdev is N-1,
otherwise it is N. This has the effect that the result is slightly different
depending on whether scaling is done first on the data and cor set to F, or
done automatically in \code{princomp} with cor=T.

The print method for the these
objects prints the results in a nice format and the
plot method produces a scree plot.
}
\value{
\code{princomp} returns an list with class \code{"princomp"}
containing the following components:
\item{var}{the variances of the principal components
(i.e. the eigenvalues)}
\item{load}{the matrix of variable loadings (i.e. a matrix
whose columns contain the eigenvectors).}
\item{scale}{the value of the \code{scale} argument.}
}
\references{
Mardia, K. V., J. T. Kent and J. M. Bibby (1979).
\emph{Multivariate Analysis}, London: Academic Press.

Venerables, W. N. and B. D. Ripley (1994).
\emph{Modern Applied Statistics with S-Plus}, Springer-Verlag.
}
\seealso{
\code{\link{prcomp}}, \code{\link{prcomponents}}, \code{\link{cor}},
\code{\link{cov}}, \code{\link{eigen}}.
}
\examples{
# the variances of the variables in the
# crimes data vary by orders of magnitude
data(crimes)
princomp(crimes)
princomp(crimes,cor=TRUE)
princomp(scale(crimes, scale=T, center=T), cor=F)
plot(princomp(crimes))
biplot(princomp(crimes))
summary(princomp(crimes))
loadings(princomp(crimes))
}

############## princomp-add.R replacement file ##############

# copyright (C) 1998 W. N. Venables and B. D. Ripley
#
predict.princomp <- function(object, newdata,...)
{
  if (missing(newdata)) return(object$scores)
  scale(newdata, object$center, object$scale) %*% object$loadings
}

summary.princomp <-
function(object, loadings = F, cutoff = 0.1, digits=3, ...)
{
  vars <- object$sdev^2
  vars <- vars/sum(vars)
  cat("Importance of components:\n")
  print(rbind("Standard deviation" = object$sdev,
              "Proportion of Variance" = vars,
              "Cumulative Proportion" = cumsum(vars)))
  if(loadings) {
    cat("\nLoadings:\n")
    cx <- format(round(object$loadings, digits=digits))
    cx[abs(object$loadings) < cutoff] <-
      substring("       ", 1, nchar(cx[1,1]))
    print(cx, quote = F, ...)
  }
  invisible(object)
}

plot.princomp <- function(x, ...) {screeplot(x, ...)}

screeplot.prcomp <- screeplot.princomp <-
function(x, npcs=min(10, length(x$sdev)), type=c("barplot", "lines"),
         main = deparse(substitute(x)), ...)
{
  eval(main)
  type <- match.arg(type)
  pcs <- x$sdev^2
  xp <- seq(length=npcs)
  if(type=="barplot") barplot(pcs[xp], names = names(pcs[xp]),
       main = main, ylab = "Variances", ...)
  else {
    plot(xp, pcs[xp], type = "b", axes = F, main = main, xlab="",
            ylab = "Variances", ...)
    axis(2)
    axis(1, at = xp, labels = names(pcs[xp]))
  }
  invisible()
}

loadings <- function(x) x$loadings

############## prcomponents.R  new file ##############

prcomponents <- function(x, center=TRUE, scale=TRUE, N=nrow(x)-1)
   {if (center) center <- apply(x,2,mean)
    else        center <- rep(0, ncol(x))
    if (scale)  scale  <- sqrt(apply(x,2,var))
    else        scale  <- rep(1, ncol(x))
    s <- svd(sweep(sweep(as.matrix(x),2, center),2, scale, FUN="/"))
    # remove anything corresponding to effectively zero singular values.
    rank <- sum(s$d > (s$d[1]*sqrt(.Machine$double.eps)))
    if (rank < ncol(x)) s$v <- s$v[,1:rank, drop=FALSE]
    s$d <- s$d/sqrt(N)

#   r <- list(sdev=s$d, proj=s$v,x=x %*% s$v, center=center, scale=scale)
    r <- list(sdev=s$d, proj=s$v, center=center, scale=scale)
    r
}

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._