[R] sparse PCA using nsprcomp package

Thu Sep 5 23:43:27 CEST 2013

Hi John

I am currently traveling and have sporadic net access, I therefore can only answer briefly. It's also quite late, I hope what follows still makes sense...

> For regular PCA by prcomp(), we can easily calculate the percent of total variance explained by the first k PCs by using cumsum(obj$sdev^2) because these PCs are independent of each other so you can simply add up the variance of these PCs. For sparse PCA, as far as I understand, the generated PCs are not independent of each other anymore, so you can not simply add up variances to calculate percentage of variance explained by the first k PCs. For example, in the package of elasticnet where spca() also performs sparse PCA, one of the output from spca() is "pev" for percent explained variation which is based on so-called "adjusted" variance that adjusted for the fact that these variances of PCs are not independent anymore.

You are correct that measuring explained variance is more involved with sparse (or non-negative) PCA, because the principal axes no longer correspond to eigenvectors of the covariance matrix, and are usually not even orthogonal.

The next update for the 'nsprcomp' package is almost done, and one of the changes will concern the reported standard deviations. In the current version (0.3), the standard deviations are computed from the scores matrix X*W, where X is the data matrix and W is the (pseudo-)rotation matrix consisting of the sparse loadings. Computing variance this way has the advantage that 'sdev' is consistent with the scores matrix, but it has the disadvantage that some of the explained variance is counted more than once because of the non-orthogonality of the principal axes. One of the symptoms of this counting is that the variance of a later component can actually exceed the variance of an earlier component, which is not possible in regular PCA.

In the new version of the package, 'sdev' will report the _additional_ standard deviation of each component, i.e. the variance not explained by the previous components. Given a basis of the space spanned by the previous PAs, the variance of the PC is computed after projecting the current PA to the ortho-complement space of the basis. This procedure reverts back to standard PCA if no sparsity or non-negativity constraints are enforced on the PAs.

> My question is for nsprcomp, how can I calculate percent explained variation by using "sdev" when I know these PCs are not independent of each other?

The new version of the package will do it for you. Until then, you can use something like the following function

asdev <- function(X, W) {
    nc <- ncol(W)
    sdev <- numeric(nc)
    Q <- qr.Q(qr(W))
    Xp <- X
    for (cc in seq_len(nc)) {
        sdev[cc] <- sd(Xp%*%W[ , cc])
        Xp <- Xp - Xp%*%Q[ , cc]%*%t(Q[ , cc])   
    }
    return(sdev)
}

to compute the additional variances for given X and W.

The package documentation will explain the above in some more detail, and I will also have a small blog post which compares the 'nsprcomp' and 'spca' routine from the 'elasticnet' package on the 'marty' data from the EMA package.

Best regards
Christian