[R-sig-eco] Variability explanations for the PCO axes as in Anderson and Willis 2003
Gavin Simpson
gavin.simpson at ucl.ac.uk
Thu Jun 2 09:38:03 CEST 2011
On Thu, 2011-06-02 at 11:09 +1000, Kari Lintulaakso wrote:
> In Anderson and Willis 2003 (Canonical Analysis of Principal
> Coordinates: A Useful Method of Constrained Ordination for Ecology)
> they write:
>
> "For multivariate analysis, the data were transformed
> to y' = ln(y + 1) to remove large differences in scale
> among the original variables. Then Bray-Curtis dissimilarities
> were calculated between every pair of observations,
> and an unconstrained ordination was done using
> PCO (metric MDS) on the dissimilarity matrix. The
> first two PCO axes explained 20.72% and 12.37% of
> the variability in the original dissimilarity matrix."
>
> Does anyone know which method produces those 20.72% and 12.37% for the PCO's?
>
> I have tried something like this:
>
> library(ecodist)
> x<-bcdist(dune) # Bray-Curtis distance, no transformation
>
> x.pco <- cmdscale(x, k=2, eig = TRUE) # PCO, PCoA
> x.pco$GOF
>
> > x.pco$GOF
> [1] 0.5714973 0.5961463
>
> I suppose these 0.57 and 0.59 are not comparable with the percentages
> in Anderson and Willis?
>
> So my question is how do I get similar kinds of variability
> explanations for the PCO axes as in A&W2003?
> And secondly how do I get those percentages for NMDS?
I'm not familiar with ecodist so show an exmaple using vegan, but the
concept is the same: PCO is an eigen decomposition of the (transformed)
dissimilarity matrix. As such it returns eigenvalues. The sum of these
eigenvalues is the total variance in the dissimilarity matrix and hence
expressing individual eigenvalues as a proportion of the sum of
eigenvalues gives you the proportion (or %) variance explained by each
axis.
R> require(vegan)
Loading required package: vegan
This is vegan 1.17-10
R> data(dune)
R> bcdune <- vegdist(dune, method = "bray")
R> mod <- capscale(bcdune ~ 1)
R> mod
Call: capscale(formula = bcdune ~ 1)
Inertia Proportion Rank
Total 4.2990
Real Total 4.5939 1.0000
Unconstrained 4.5939 1.0000 14
Imaginary -0.2949 5
Inertia is squared Bray distance
Eigenvalues for unconstrained axes:
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
1.716266 1.022398 0.461464 0.382249 0.279135 0.236631 0.169120 0.096245
MDS9 MDS10 MDS11 MDS12 MDS13 MDS14
0.074492 0.061712 0.054940 0.019174 0.016119 0.004001
So we can extract the eigenvalues directly:
R> eig <- mod$CA$eig
or better in newer versions of vegan using the eigenvals() extractor
function:
R> eig <- eigenvals(mod)
Then we can compute the proportion of variance explained on each axis:
R> eig / sum(eig)
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6
0.3735929532 0.2225533017 0.1004504626 0.0832071354 0.0607613786 0.0515092860
MDS7 MDS8 MDS9 MDS10 MDS11 MDS12
0.0368137410 0.0209504326 0.0162151974 0.0134333287 0.0119593153 0.0041738163
MDS13 MDS14
0.0035087418 0.0008709095
Or the cumulative sum of the proportion of variance explained:
R> cumsum(eig / sum(eig))
MDS1 MDS2 MDS3 MDS4 MDS5 MDS6 MDS7 MDS8
0.3735930 0.5961463 0.6965967 0.7798039 0.8405652 0.8920745 0.9288883 0.9498387
MDS9 MDS10 MDS11 MDS12 MDS13 MDS14
0.9660539 0.9794872 0.9914465 0.9956203 0.9991291 1.0000000
Things can be a bit more complex with PCO as some dissimilarity
coefficients are not metric and thus cannot be represented in an
Euclidean space (what PCO is trying to do) so we get negative
eigenvalues. The software can deal with this by adding a value to the
dissimilarities to avoid the negative eigenvalues, but just be aware of
this issue - the negative eigenvalues, if they occur, represent variance
in imaginary dimensions (as we have imaginary numbers and normal numbers
[I forget the correct term, rational numbers?] in mathematics) and
possibly need to be considered.
What vegan does in capscale() is just ignore the negative eigenvalues,
or you can add a constant or sqrt the dissimilarities to try to avoid
the problem. The eagle eyed amongst you will have noticed the Imaginary
Inertia component in the summary at the start of my message.
How would you do the analysis using cmdscale()? Here are the eigenvalues
from cmdscale()
R> cmdscale(bcdune, k=NROW(dune)-1, eig = TRUE)$eig
[1] 1.716266e+00 1.022398e+00 4.614641e-01 3.822492e-01 2.791345e-01
[6] 2.366309e-01 1.691204e-01 9.624517e-02 7.449176e-02 6.171200e-02
[11] 5.494046e-02 1.917429e-02 1.611897e-02 4.000912e-03 -1.942890e-16
[16] -2.642513e-02 -4.285699e-02 -5.473417e-02 -7.412306e-02 -9.678567e-02
Warning message:
In cmdscale(bcdune, k = NROW(dune) - 1, eig = TRUE) :
only 14 of the first 19 eigenvalues are > 0
Notice that I ask for the full representation in n-1 dimensional
Euclidean space. The eigenvalues returned are the same whatever value of
`k` is supplied. Notice also how the eigenvalues have gone negative -
this is because the bray curtis dissimilarity is not a metric
coefficient. We can use argument `add` to solve this problem:
R> mod2 <- cmdscale(bcdune, k=(NROW(dune)-1), eig = TRUE, add = TRUE)
Warning message:
In cmdscale(bcdune, k = (NROW(dune) - 1), eig = TRUE, add = TRUE) :
only 18 of the first 19 eigenvalues are > 0
R> mod2$eig
[1] 2.600250e+00 1.604963e+00 8.154912e-01 6.868628e-01 5.690645e-01
[6] 4.729431e-01 3.716289e-01 2.558978e-01 2.317021e-01 2.088922e-01
[11] 1.925223e-01 1.438914e-01 1.357564e-01 1.311664e-01 7.175410e-02
[16] 5.164569e-02 3.686007e-02 9.213435e-03 -1.905371e-16 -5.325325e-16
The last two eigenvalues are indistinguishable from zero (remember this
is a computer doing binary arithmetic!). The eigenvals() extractor does
the right thing and sets these almost-zero eigenvalues to zero:
R> (eig2 <- eigenvals(mod2))
[1] 2.6002497 1.6049627 0.8154912 0.6868628 0.5690645 0.4729431 0.3716289
[8] 0.2558978 0.2317021 0.2088922 0.1925223 0.1438914 0.1357564 0.1311664
[15] 0.0717541 0.0516457 0.0368601 0.0092134 0.0000000 0.0000000
So now we can do as before and compute variance explained per axis or
cumulative over k = 1, ..., n-1 axes.
R> eig2 / sum(eig2)
[1] 0.30268881 0.18682984 0.09492937 0.07995605 0.06624343 0.05505417
[7] 0.04326043 0.02978845 0.02697189 0.02431664 0.02241105 0.01675005
[13] 0.01580308 0.01526876 0.00835272 0.00601195 0.00429079 0.00107251
[19] 0.00000000 0.00000000
R> cumsum(eig2 / sum(eig2))
[1] 0.3026888 0.4895187 0.5844480 0.6644041 0.7306475 0.7857017 0.8289621
[8] 0.8587505 0.8857224 0.9100391 0.9324501 0.9492002 0.9650033 0.9802720
[15] 0.9886247 0.9946367 0.9989275 1.0000000 1.0000000 1.0000000
Notice that these numbers are slightly different to those from
capscale() because capscale() only considered 14 PCO axes (those with
positive eigenvalues), whereas the above uses all axes.
As for nMDS, there are no axes and as such it is not possible or needed
to provide a measure of variance explained per axis. We can provide a
measure of the lack of fit between the k-dimensional mapping of the
dissimilarities and the original dissimilarities, using the stress
statistic and related metrics.
HTH
G
> I really am a newcomer in these kinds of analysis and any kind of
> guidance will help.
>
> -Kari
>
--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Dr. Gavin Simpson [t] +44 (0)20 7679 0522
ECRC, UCL Geography, [f] +44 (0)20 7679 0565
Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/
UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
More information about the R-sig-ecology
mailing list