[R-sig-eco] Comparison and interpretation of results from envfit, ordisurf and bioenv in the vegan package
Jeff
jrg123 at mail.com
Tue Jan 6 08:26:01 CET 2015
Dear forum,
My first time posting to this forum -- afraid I am going to start with a
somewhat conceptual/complicated question.
My question concerns the analysis of environmental factors as predictors
of community (using vegan). Specifically I am trying to understand why
envit and ordisurf appear to be giving me a qualitatively different
picture of environmental correlations with my community matrix when
compared with the bioenv function/approach.
Briefly:
I am using NMDS to reduce dimensionality of a large dataset that of soil
microbes across broad geographic scale (there are over 250 sites and 60
species in the matrix). I got the metaMDS in vegan to converge on a
solution despite the high number of site pairs that shared no species
(~30%) by using the noshare=0.1 modifier.
Call:
metaMDS(comm = decostand(comm.matrix, "pa"), distance = "jaccard", k =
3, trymax = 150, engine = "monoMDS", noshare = 0.1, weakties = TRUE,
stress = 1, maxit = 500, scaling = TRUE, pc = TRUE, smin = 1e-04,
sfgrmin = 1e-07, sratmax = 0.99999, zerodist = "add")
global Multidimensional Scaling using monoMDS
Data: pa(comm.matrix)
Distance: jaccard shortest
Dimensions: 3
Stress: 0.1598281
Stress type 1, weak ties
Two convergent solutions found after 34 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘pa(comm.matrix)’
Next I ran I envfit on a corresponding environmental matrix which
contained both continuous and categorical data.
ef1 <- envfit(nmds_output, envir.matrix, permu = 999, na.rm=TRUE) #(very
long) output not shown
This as I understand it assumes a linear relationship between the
continuous variables and the metaMDS axes. This didn't seem quite right
so I used gam via ordisurf to get a better estimate and visualization of
the relationships for all of the variables individually. There was
reasonable correspondence between estimates of the strengths of the
relationships using envfit and ordisurf.
ordi<-list()
for (i in 1:ncol(envir.matrix)){ ordi[[i]]<-ordisurf(nmds_final,
envir.matrix[,i], add=FALSE) } #output not shown
Of course many of the environmental variables considered are highly
correlated and/or may have additive effects, so ideally I wanted to do
some sort of model comparison that would choose the most parsimonious
model that best describes the community using the optimal number of
predictors. A good option seemed to be bioenv. Having read the vegan
help and various tutorial as well as Clarke, K. R & Ainsworth, M. 1993 I
understand that bioenv uses a totally different approach, which as I
understand is based on the rank correlations between distance matrices
calculated from the (biotic) community data v. subsets of potential
predictors from an environmental matrix. (Hope I have that right).
#bioenv.output<-bioenv(comm=vegdist(decostand(comm.matrix, "pa"),
"jaccard"),env=envir.matrix, metric="gower", upto=7) #yes, this took a
long time
> bioenv.output
Call:
bioenv(comm = vegdist(decostand(comm.matrix, "pa"), "jaccard"), env =
envir.matrix, upto = 7, metric = "gower")
Subset of environmental variables with best correlation to community data.
Correlations: spearman
Dissimilarities: jaccard
Metric: gower
Best model has 4 parameters (max. 7 allowed):
bio_1 bio_10 bio_19 Latitude
with correlation 0.3217859
Where I'm coming unstuck is that envfit and ordisurf give me an entirely
different picture of which variables are important when compared with
the bioenv function. For example, a variable called "Bioregion" is
highly significant with respect the to the prediction of axes 1 and 2 of
the NMDS using envfit, and based on an R-squared of 0.32 is the
strongest predictor of the lot. Alternatively, "Bioregion" never makes
the top models at all when I use the bioenv function. One could argue
that maybe "Bioregion" is correlated with other variables like latitude
and that these other variables are simply better, but actually Bioregion
alone is a terrible predictor based on bioenv. I know this since I can
force bioenv to choose Bioregion (by giving it only a subset of the
environmental matrix that I know to contain really poor predictors), and
when do this I get an r value of 0.05. This is only an example --
several of the top performers according to envfit do not feature at all
using the bioenv model selection approach.
So my question really is 'how do I interpret/deal with these
discrepancies?' I know that the r values from the two approaches mean
totally different things and can't be directly compared, but one
approach leads me to think that a certain subset of the environmental
variables might be important, and the other approach appears to be
telling me something different. I must admit this is making my brain
hurt as I would expect at least moderate correspondence or agreement
among methods. I like the bioenv function as it allows competing models
and the evaluation of combinations of variables, though I must admit
that envfit/ordisurf approach is a bit more intuitive to me.
Has anyone else worked through similar issues? Am I missing something
obvious, and/or are there suggestions for a way forward?
Thanks in advance for any help!
Jeff
More information about the R-sig-ecology
mailing list