[R-sig-eco] Comparison and interpretation of results from envfit, ordisurf and bioenv in the vegan package

Tue Jan 6 08:26:01 CET 2015

Dear forum,

My first time posting to this forum -- afraid I am going to start with a 
somewhat conceptual/complicated question.

My question concerns the analysis of environmental factors as predictors 
of community (using vegan). Specifically I am trying to understand why 
envit and ordisurf appear to be giving me a qualitatively different 
picture of environmental correlations with my community matrix when 
compared with the bioenv function/approach.

Briefly:

I am using NMDS to reduce dimensionality of a large dataset that of soil 
microbes across broad geographic scale (there are over 250 sites and 60 
species in the matrix). I got the metaMDS in vegan to converge on a 
solution despite the high number of site pairs that shared no species 
(~30%) by using the noshare=0.1 modifier.

Call:
metaMDS(comm = decostand(comm.matrix, "pa"), distance = "jaccard", k = 
3, trymax = 150, engine = "monoMDS", noshare = 0.1, weakties = TRUE, 
stress = 1, maxit = 500, scaling = TRUE, pc = TRUE, smin = 1e-04, 
sfgrmin = 1e-07, sratmax = 0.99999, zerodist = "add")

global Multidimensional Scaling using monoMDS

Data: pa(comm.matrix)
Distance: jaccard shortest

Dimensions: 3
Stress: 0.1598281
Stress type 1, weak ties
Two convergent solutions found after 34 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘pa(comm.matrix)’

Next I ran I envfit on a corresponding environmental matrix which 
contained both continuous and categorical data.

ef1 <- envfit(nmds_output, envir.matrix, permu = 999, na.rm=TRUE) #(very 
long) output not shown

This as I understand it assumes a linear relationship between the 
continuous variables and the metaMDS axes. This didn't seem quite right 
so I used gam via ordisurf to get a better estimate and visualization of 
the relationships for all of the variables individually. There was 
reasonable correspondence between estimates of the strengths of the 
relationships using envfit and ordisurf.

ordi<-list()
for (i in 1:ncol(envir.matrix)){ ordi[[i]]<-ordisurf(nmds_final, 
envir.matrix[,i], add=FALSE) } #output not shown

Of course many of the environmental variables considered are highly 
correlated and/or may have additive effects, so ideally I wanted to do 
some sort of model comparison that would choose the most parsimonious 
model that best describes the community using the optimal number of 
predictors. A good option seemed to be bioenv. Having read the vegan 
help and various tutorial as well as Clarke, K. R & Ainsworth, M. 1993 I 
understand that bioenv uses a totally different approach, which as I 
understand is based on the rank correlations between distance matrices 
calculated from the (biotic) community data v. subsets of potential 
predictors from an environmental matrix. (Hope I have that right).

#bioenv.output<-bioenv(comm=vegdist(decostand(comm.matrix, "pa"), 
"jaccard"),env=envir.matrix, metric="gower", upto=7) #yes, this took a 
long time

 > bioenv.output

Call:
bioenv(comm = vegdist(decostand(comm.matrix, "pa"), "jaccard"), env = 
envir.matrix, upto = 7, metric = "gower")

Subset of environmental variables with best correlation to community data.

Correlations: spearman
Dissimilarities: jaccard
Metric: gower

Best model has 4 parameters (max. 7 allowed):
bio_1 bio_10 bio_19 Latitude
with correlation 0.3217859

Where I'm coming unstuck is that envfit and ordisurf give me an entirely 
different picture of which variables are important when compared with 
the bioenv function. For example, a variable called "Bioregion" is 
highly significant with respect the to the prediction of axes 1 and 2 of 
the NMDS using envfit, and based on an R-squared of 0.32 is the 
strongest predictor of the lot. Alternatively, "Bioregion" never makes 
the top models at all when I use the bioenv function. One could argue 
that maybe "Bioregion" is correlated with other variables like latitude 
and that these other variables are simply better, but actually Bioregion 
alone is a terrible predictor based on bioenv. I know this since I can 
force bioenv to choose Bioregion (by giving it only a subset of the 
environmental matrix that I know to contain really poor predictors), and 
when do this I get an r value of 0.05. This is only an example -- 
several of the top performers according to envfit do not feature at all 
using the bioenv model selection approach.

So my question really is 'how do I interpret/deal with these 
discrepancies?' I know that the r values from the two approaches mean 
totally different things and can't be directly compared, but one 
approach leads me to think that a certain subset of the environmental 
variables might be important, and the other approach appears to be 
telling me something different. I must admit this is making my brain 
hurt as I would expect at least moderate correspondence or agreement 
among methods. I like the bioenv function as it allows competing models 
and the evaluation of combinations of variables, though I must admit 
that envfit/ordisurf approach is a bit more intuitive to me.

Has anyone else worked through similar issues? Am I missing something 
obvious, and/or are there suggestions for a way forward?

Thanks in advance for any help!

Jeff