[R-sig-eco] Autocorrelation structures and simplicity of analysis

Wed Jul 17 18:27:35 CEST 2013

Philip,

I agree that the simpler analysis has it's merits. In fact, this is where we started our analysis, and these are the results we present in the current draft of the paper.  It is reassuring to know that they are both statistically valid and conservative; we could leave the analysis there. However, I think this form of analysis is limited (as discussed by Zuur pg.105, and others) and we have had problems publishing results such as this before.

I believe is better to model the data directly rather than a regression parameter, as given our current computer power, there isn't anything stopping us except our own statistical knowledge and understanding.  But more importantly grouping the data into, say, an seasonal means throws away information that could be significant in our analysis.  That is, our mixed effects model would have no way of knowing whether we are using a single temperature reading taken on a single date, or whether our datapoint is the result of many readings (in our case 92 days x 15 minute readings = 8832 readings per plot).  Surely the latter is a more reliable measure of the summer temperature?  How can we capture this extra certainty in our final model?  Our current dataset has a strong enough signal-to-noise ratio to get a result this way, but we will extend this analysis over several years and for other parameters that may not have such a clear signal.  Also, this is a longstanding problem in our temperature data, and previous datasets have had very poor signal-to-noise ratios indeed; we need to extract every bit of information from those as our current (conservative) model estimates are not significant.

Also, if we work with modelled regression parameters we cannot pool _all_ our information into a single analysis; that is, we cannot account for everything at once, allowing each parameter to interact with the others in a continuous way.  We can compute mean temperature per plot, then from that model treatment effect (end perhaps a plot random effect) but then we can't speak about overall variance.  We could compute variance in each plot, but then we cannot speak about mean effect or plot effect (and what is the mean of a variance anyway?).  We could subtract the mean effect off first, but then we are merely walking down the path of doing the analysis we _really_ want the hard way.  I also believe we get onto shifty ground as to whether these parameters might interact, or pull significance from each other.  I don't profess to understand the subtlety of the math enough to determine this a priori.

Another limitation of analysis of regression parameters is that we have not accounted for the autocorrelation in the time-series data. If we look at the response residuals (raw residuals, i.e. the ones that do no remove the AR1 signal) from the mod.gls model, we get fairly large variances, and each plot mean looks approximately the same -- these are the results we get from a simple computation of mean or sd on each time-series. The problem is seasonal fluctuation (which is random, and autocorrelated) is overwhelming the sd calculation and masking the actual plot level response. That is, the variance over the course of the season is bigger than the plot variance for any given time snapshot.  We could model the seasonal fluctuation using a gam (I have done this) but a) it will be different for each year and b) it really isn't a parameter of interest anyway.  That is, I am not trying to predict what the temperature in a plot will be given treatment _and_ time of year, just treatment because each year is going to be different.  (For the record, there is no apparent long-term seasonal effect linear or otherwise.) Hence we DO need to consider auto-correlation effect because even simple computations such as mean and variance assume independence!  In this case an AR1 model (or the ARMA(2,0) model I have running) leaves cleaner residuals than the gam, as well as being more appropriate in a philosophical sense.

Regarding your discussion of variance structure.  I agree that there is a mixed model that includes both AR1 and compound symmetric components (a fully ME model with intercept and slope terms). I need to spend more time thinking about this as my intuition had proposed a model that looked more like a product than a sum; however, I'm thinking you might be right that the covariance structure is additive.  Zuur talks about the implied correlation structure from a full random model (pg.113, but it is not in an easy to picture form).  He does not discuss a slope only model, but when I look at his example and pump my own math through, I do not see it decreasing with "distance" between readings, i.e. there is no covariance term on the readings.

So, I'm not sure how to model this.  Do you have suggestions?  Currently the two stage analysis I present (final code block, discussion question 3) comes closest to the final product I would like, and may in the end be valid.  Do you have thoughts on the validity here?  Functionally it is modelling the treatment effect and AR1 process first, removing those, then computing the plot effect afterwards.

Thanks for your help,
Jonathan

________________________________________
From: Dixon, Philip M [STAT] [pdixon at iastate.edu]
Sent: July 16, 2013 8:34 AM
To: r-sig-ecology at r-project.org
Subject: [R-sig-eco] Autocorrelation structures and simplicity of analysis

Jonathan,

The AR(1) + compound symmetry model does make sense.  It is a model where the correlation over time declines with the separation between times points, but it does not decline to zero.  There is a constant non-zero (at least potentially) correlation arising from the compound symmetry (constant correlation between all pairs of observations on the same plot) part of the model.  I have found it fits many data sets better than either AR(1) only or CS only.

However, I suggest you completely rethink the analysis.  I am a great fan of the approach recommended by Paul Murtaugh a few years ago.  The paper is "Simplicity and complexity in ecological data analysis" Author(s): Murtaugh, Paul A.  Source: ECOLOGY  Volume: 88   Issue: 1   Pages: 56-62   Published: JAN 2007

Your situation is almost identical to one he considers in his paper.  His advice: remember that you have experimental units randomly assigned to warming treatments.  Your analysis currently ignores time, except for accounting for the autocorrelation.  You seem to care only about the treatment effect.  If that's the case, you don't need to worry about days.  Just focus the analysis on plots and the treatment effect.  Specifically, figure out an informative summary statistic for each plot (e.g., average over time, median over time), calculate those 12 numbers, then analyze those.  No need to worry about the autocorrelation structure then!

The beauty of this approach is that if you want to ask a different question, e.g. about day-day variability instead of the average, you just change your summary statistic to something that measures variability, e.g. log sd.

Best wishes,
Philip Dixon