[R-sig-eco] Variability explanations for the PCO axes as in Anderson and Willis 2003

Thu Jun 2 09:38:03 CEST 2011

On Thu, 2011-06-02 at 11:09 +1000, Kari Lintulaakso wrote:
> In Anderson and Willis 2003 (Canonical Analysis of Principal
> Coordinates: A Useful Method of Constrained Ordination for Ecology)
> they write:
> 
> "For multivariate analysis, the data were transformed
> to y' = ln(y + 1) to remove large differences in scale
> among the original variables. Then Bray-Curtis dissimilarities
> were calculated between every pair of observations,
> and an unconstrained ordination was done using
> PCO (metric MDS) on the dissimilarity matrix. The
> first two PCO axes explained 20.72% and 12.37% of
> the variability in the original dissimilarity matrix."
> 
> Does anyone know which method produces those 20.72% and 12.37% for the PCO's?
> 
> I have tried something like this:
> 
> library(ecodist)
> x<-bcdist(dune)				                # Bray-Curtis distance, no transformation
> 
> x.pco <- cmdscale(x, k=2, eig = TRUE)	# PCO, PCoA
> x.pco$GOF
> 
> > x.pco$GOF
> [1] 0.5714973 0.5961463
> 
> I suppose these 0.57 and 0.59 are not comparable with the percentages
> in Anderson and Willis?
> 
> So my question is how do I get similar kinds of variability
> explanations for the PCO axes as in A&W2003?
> And secondly how do I get those percentages for NMDS?

I'm not familiar with ecodist so show an exmaple using vegan, but the
concept is the same: PCO is an eigen decomposition of the (transformed)
dissimilarity matrix. As such it returns eigenvalues. The sum of these
eigenvalues is the total variance in the dissimilarity matrix and hence
expressing individual eigenvalues as a proportion of the sum of
eigenvalues gives you the proportion (or %) variance explained by each
axis.

R> require(vegan)
Loading required package: vegan
This is vegan 1.17-10
R> data(dune)
R> bcdune <- vegdist(dune, method = "bray")
R> mod <- capscale(bcdune ~ 1)
R> mod
Call: capscale(formula = bcdune ~ 1)

              Inertia Proportion Rank
Total          4.2990                
Real Total     4.5939     1.0000     
Unconstrained  4.5939     1.0000   14
Imaginary     -0.2949               5
Inertia is squared Bray distance 

Eigenvalues for unconstrained axes:
    MDS1     MDS2     MDS3     MDS4     MDS5     MDS6     MDS7     MDS8 
1.716266 1.022398 0.461464 0.382249 0.279135 0.236631 0.169120 0.096245 
    MDS9    MDS10    MDS11    MDS12    MDS13    MDS14 
0.074492 0.061712 0.054940 0.019174 0.016119 0.004001

So we can extract the eigenvalues directly:

R> eig <- mod$CA$eig

or better in newer versions of vegan using the eigenvals() extractor
function:

R> eig <- eigenvals(mod)

Then we can compute the proportion of variance explained on each axis:

R> eig / sum(eig)
        MDS1         MDS2         MDS3         MDS4         MDS5         MDS6 
0.3735929532 0.2225533017 0.1004504626 0.0832071354 0.0607613786 0.0515092860 
        MDS7         MDS8         MDS9        MDS10        MDS11        MDS12 
0.0368137410 0.0209504326 0.0162151974 0.0134333287 0.0119593153 0.0041738163 
       MDS13        MDS14 
0.0035087418 0.0008709095 

Or the cumulative sum of the proportion of variance explained:

R> cumsum(eig / sum(eig))
     MDS1      MDS2      MDS3      MDS4      MDS5      MDS6      MDS7      MDS8 
0.3735930 0.5961463 0.6965967 0.7798039 0.8405652 0.8920745 0.9288883 0.9498387 
     MDS9     MDS10     MDS11     MDS12     MDS13     MDS14 
0.9660539 0.9794872 0.9914465 0.9956203 0.9991291 1.0000000

Things can be a bit more complex with PCO as some dissimilarity
coefficients are not metric and thus cannot be represented in an
Euclidean space (what PCO is trying to do) so we get negative
eigenvalues. The software can deal with this by adding a value to the
dissimilarities to avoid the negative eigenvalues, but just be aware of
this issue - the negative eigenvalues, if they occur, represent variance
in imaginary dimensions (as we have imaginary numbers and normal numbers
[I forget the correct term, rational numbers?] in mathematics) and
possibly need to be considered.

What vegan does in capscale() is just ignore the negative eigenvalues,
or you can add a constant or sqrt the dissimilarities to try to avoid
the problem. The eagle eyed amongst you will have noticed the Imaginary
Inertia component in the summary at the start of my message.

How would you do the analysis using cmdscale()? Here are the eigenvalues
from cmdscale()

R> cmdscale(bcdune, k=NROW(dune)-1, eig = TRUE)$eig
 [1]  1.716266e+00  1.022398e+00  4.614641e-01  3.822492e-01  2.791345e-01
 [6]  2.366309e-01  1.691204e-01  9.624517e-02  7.449176e-02  6.171200e-02
[11]  5.494046e-02  1.917429e-02  1.611897e-02  4.000912e-03 -1.942890e-16
[16] -2.642513e-02 -4.285699e-02 -5.473417e-02 -7.412306e-02 -9.678567e-02
Warning message:
In cmdscale(bcdune, k = NROW(dune) - 1, eig = TRUE) :
  only 14 of the first 19 eigenvalues are > 0

Notice that I ask for the full representation in n-1 dimensional
Euclidean space. The eigenvalues returned are the same whatever value of
`k` is supplied. Notice also how the eigenvalues have gone negative -
this is because the bray curtis dissimilarity is not a metric
coefficient. We can use argument `add` to solve this problem:

R> mod2 <- cmdscale(bcdune, k=(NROW(dune)-1), eig = TRUE, add = TRUE)
Warning message:
In cmdscale(bcdune, k = (NROW(dune) - 1), eig = TRUE, add = TRUE) :
  only 18 of the first 19 eigenvalues are > 0
R> mod2$eig
 [1]  2.600250e+00  1.604963e+00  8.154912e-01  6.868628e-01  5.690645e-01
 [6]  4.729431e-01  3.716289e-01  2.558978e-01  2.317021e-01  2.088922e-01
[11]  1.925223e-01  1.438914e-01  1.357564e-01  1.311664e-01  7.175410e-02
[16]  5.164569e-02  3.686007e-02  9.213435e-03 -1.905371e-16 -5.325325e-16

The last two eigenvalues are indistinguishable from zero (remember this
is a computer doing binary arithmetic!). The eigenvals() extractor does
the right thing and sets these almost-zero eigenvalues to zero:

R> (eig2 <- eigenvals(mod2))
 [1] 2.6002497 1.6049627 0.8154912 0.6868628 0.5690645 0.4729431 0.3716289
 [8] 0.2558978 0.2317021 0.2088922 0.1925223 0.1438914 0.1357564 0.1311664
[15] 0.0717541 0.0516457 0.0368601 0.0092134 0.0000000 0.0000000

So now we can do as before and compute variance explained per axis or
cumulative over k = 1, ..., n-1 axes.

R> eig2 / sum(eig2)
 [1] 0.30268881 0.18682984 0.09492937 0.07995605 0.06624343 0.05505417
 [7] 0.04326043 0.02978845 0.02697189 0.02431664 0.02241105 0.01675005
[13] 0.01580308 0.01526876 0.00835272 0.00601195 0.00429079 0.00107251
[19] 0.00000000 0.00000000
R> cumsum(eig2 / sum(eig2))
 [1] 0.3026888 0.4895187 0.5844480 0.6644041 0.7306475 0.7857017 0.8289621
 [8] 0.8587505 0.8857224 0.9100391 0.9324501 0.9492002 0.9650033 0.9802720
[15] 0.9886247 0.9946367 0.9989275 1.0000000 1.0000000 1.0000000

Notice that these numbers are slightly different to those from
capscale() because capscale() only considered 14 PCO axes (those with
positive eigenvalues), whereas the above uses all axes.

As for nMDS, there are no axes and as such it is not possible or needed
to provide a measure of variance explained per axis. We can provide a
measure of the lack of fit between the k-dimensional mapping of the
dissimilarities and the original dissimilarities, using the stress
statistic and related metrics.

HTH

G

> I really am a newcomer in these kinds of analysis and any kind of
> guidance will help.
> 
> -Kari
> 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%