[R-sig-eco] quantifying directed dependence of environmental factors

Mon Mar 11 18:58:36 CET 2013

Hi Jay,

What you describe is similar to research conducted at UCL by myself and
colleagues and also as part of an EU project that finished a few years
ago now.

Since then, my thoughts on this have expanded a little and Sarah's point
about path analysis was where I would have gone next if I was continuing
this line of investigation.

In the work I did at UCL, I went with a time series approach. I
decomposed the species data into a set of ordination axes (I used PCA,
but on Hellinger transformed data to account for non-linear responses in
the species which PCA doesn't handle well). Then I fitted an additive
model with, say, PCA axis 1 (PC1) as the response variable and one or
more covariates entering as smooth functions as the predictors. The nice
things about the additive model is that the terms come together
additively and the non-linear effects allow the effect of a variable to
change (i.e. cold temperatures only induce an "effect" on the response).
As these were time series data I controlled for autocorrelation by using
a continuous time AR(1) process for the residuals.

Here are some refs on these approaches:

http://www.aslo.org/lo/toc/vol_54/issue_6_part_2/2529.html
http://dx.doi.org/10.1111/j.1365-2427.2012.02860.x
http://dx.doi.org/10.1111/j.1365-2427.2011.02651.x
http://dx.doi.org/10.1111/j.1365-2427.2011.02670.x

and there is more in the Special Issue section of FWBiol:

http://onlinelibrary.wiley.com/doi/10.1111/fwb.2012.57.issue-10/issuetoc

Now this doesn't look at cascades of effects; one key aspect of all of
the above was the construction of appropriate time series for the
environmental data. Ideally, I'd take the variable that is most closely
related to diatom physiology as my predictor. However those variables do
not always exist or are not available. Instead surrogates can be used;
amount of agricultural land-use in catchments or fertilizer-use
historical records will be highly correlated with nutrient loading from
a catchment to the lake/reservoir, so these could be used instead. Of
course, you do need to wary of spurious correlations with time series
data.

As these models are using smooth functions, you are only going to be
able to include one or a few covariates unless you have *lots* of
samples; and anyway, I would always advise to think first from the
biological or ecological viewpoint and formulate an hypothesis there and
then fit that with the stats rather than throwing lots of variables into
an analysis to see what pops out (which seems to be what a lot of
palaeoecologists do!)

As regards envfit(), it isn't symmetric in the variables at least as far
as I see it; it fits a model of

varZ = \beta_1 axis_1 + \beta_2 axis_2 + \varepsilon

in other words it uses the 2d "axis" scores (PC1 and PC2, or nMDS1 +
nMDS2 coordinates) to predict the values of the response using a linear
model. As each environmental variable is modelled separately
(individually), one is not favouring a set of variables etc. Perhaps
this is not what you meant but worth pointing out.

Also, envfit presumes a linear relationship between the variables and
the ordination coordinates. If that is too strong an assumption, see
ordisurf() which fits GAM-based surfaces rather than linear-regression
surfaces.

A fairly standard way of looking at this sort of data might be to group
variables that are related and then decompose the variance in the data
in that which can be explained by each group of variables uniquely, that
which can be explained by two or more groups, and the unexplained
variance. The vegan package has a function varpart() which can do this
all for you if you are willing to use RDA to analyse the data (unbiased
estimates of the variance explained are not available for CCA and nMDS
is not a constrained technique) - note you can use principal coordinates
analysis to embed your original dissimilarity matrix into a metric space
and then take the PCoA axis scores as the input "data" for the RDA so
that the RDA is in the dissimilarity data of your choice and not linear
in the original data.

Steve Juggins has adapted the hierarchical partitioning approach from
package hier.part to the multivariate multiple regression setting of RDA
(possibly CCA too?) which is related to but somewhat different to the
variance partitioning described above. I don't believe Steve has
released this code yet, so if interested I'd emailing him for it; he is
the author of the rioja package so contact details can be found on CRAN.

Neither variance partitioning or hierarchical partitioning directly do
exactly what you ask and model the directed dependence or pathways of
effects. They are however far simpler methods which would have
familiarity within the applied community that will see these
results/papers etc.

In writing this I have pondered on whether you and/or the ecologists are
making it too complex? As you have all the variables of interest, I
might model the variables that physiologically affect the diatoms and
their effect on diatom composition (using constrained ordination or the
additive model time series approach I used). Then I would model the
relationship between the higher-level factor (land-use, etc) that I
hypothesise to be driving changes in the lower-level,
physiologically-relevant, variable.

Whilst path analysis might be ideal tool here, I suspect you'll have
plenty of complications to address, not least the compositional aspects
(though Sarah points you to some ideas there), but also the temporal
autocorrelation prevalent in the palaeo data which would need to be
accounted for to yield appropriate p-values. If you do address these
issues and use path analysis, I for one would be very interested to here
how you got on as this would be a very useful contribution to the
palaeolimnological literature!

Wow, this got long! Anyway, hopefully some of the above will be of use
and interest, and should you wish to chat about this off-list some more
(I'm certainly very interested if you use path analysis for this), then
please do so. (Note I am still sending via my old UCL address as I have
moved to Canada and U Regina, contact details in my email signature.)

All the best,

Gavin

On Wed, 2013-03-06 at 22:12 -0500, Jay Kerns wrote:
> Hello,
> 
> I'm posting to this list because I believe it's the best place to
> go.  My question is R related only inasmuch as all the work I've
> done so far has been with R and I expect any answers I get from
> here will lead me to more R work.
> 
> I'm consulting with an ecologist and an engineer on a project
> related to a reservoir nearby.  They've collected data on diatoms
> in the reservoir via core samples; they have sections of data over
> the past 100yrs.  They are looking at the community structure
> plus other environmental factors over the same time period.
> 
> We've done a ton of work already and there's no point trying to
> hash all of that out here.  Short story: we did an NMDS, it fits
> OK (stress 0.17), there are obvious clusters in the ordination
> which correspond to a-priori clusters from ecological
> considerations (and which match an independent cluster analysis),
> we're really quite pleased overall.  We checked for relationships
> with =envfit=, most environmental variables are *highly*
> significant, yet there are a couple which aren't significant at
> all.  Here comes my question:
> 
> The ecologist pointed out to me that our environmental variables
> don't have equal status (ecologically speaking); some variables
> lead to others.  For instance, there are so-called ultimate
> factors (population, percentage farmland) which contribute to
> intermediate factors (suspended solids, total phosphorous) which
> in turn contribute to direct factors (AREA, pH,...) which then in
> turn contribute to diatom structure.
> 
> We have measured data on all the above and several more.  The
> model we are fitting with =envfit= is symmetric in those n
> environmental variables, but the ecology of the situation isn't
> symmetric, it's a directed top-down kind of relationship.  He
> asked me, "How can we quantify that?  How can we demonstrate
> that?  Can we quantify/demonstrate that?"  I don't know.
> 
> There are ecologists on this list: what am I looking for, here?
> What methods do ecologists use to answer this (or related)
> question(s)?  Feel free to direct me to papers, literature,
> textbooks, whatever.  I'm trying to help answer this question
> and (this not being my subject specialty) I'm at a bit of a loss.
> 
> If there are relevant R packages/vignettes/manuals you can point
> me to, that'd be cool too.
> 
> Thanks for reading all the way down to here.
> 
> Jay
> 
> P.S. If it hadn't been for the archives of this list containing
> lengthy and poignant answers to *several* questions I've had
> already then I couldn't even have made it this far.  Thank you!
> 
> 
> 

-- 
Gavin Simpson, PhD                          [t] +1 306 337 8863
Institute of Environmental Change & Society [f] +1 306 337 2410
523 Research and Innovation Centre          [e] gavin.simpson at uregina.ca
University of Regina                        [tw] @ucfagls
Regina, SK S4S 0A2, Canada