[R-sig-ME] time series with autcorrelated errors - matrix not invertible

Mon Aug 22 18:31:57 CEST 2016

   My original code involved use ar() to assess the appropriate AR model 
and then fit that model using gls(). This is not the general approach 
outlined in Fox and Weisberg's appendix chapter on Time-Series 
regression. They describe a proper approach (see below). The problematic 
nature of my original approach is hinted at by Ben Bolker noting that 
ar() and gls() are "using different algorithms".

Fox and Weisberg fit gls models for various orders and then choose among 
them. The proper point is that we need to compare models with different 
orders for the dependency of the errors. My approach incorrectly based 
the choice of order on the time series itself.

There is a wrinkle in the present context that is not addressed in Fox 
and Weisberg. In my analysis, I want to assess whether a constant

gls(CV ~ 1,....correlation = corARMA(p=?))

or time-dependent model

gls(CV ~ 1 + time,....correlation = corARMA(p=?))

has more support for a given time series.

So, which model is best to use (1 or 1 + time) as the model used to 
determine the value of p? Is there an a priori reason to choose one of 
them? The two models may suggest different values of p.

In addition, suppose one starts with, say,

gls(CV ~ 1.......correlation = corARMA(p=x))

and determines that p = x generates a (constant) model with the lowest 
AIC among those considered. it is easy to find time series (available 
upon request) for which for

gls(CV ~ 1 + time......correlation = corARMA(p=x))

canNOT be fit

or vice-versa.

does the simpler model have preference as a basis for choice of p? 
suppose it does and the time model cannot be fit for the same value of p.

What p value does one choose? One possible rule is to choose the order 
for which both models can be fit but there will likely be more than one 
value of p for which this occurs. How does one choose among them?

thoughts and suggestions will be much appreciated.

many thanks!

   > I am analyzing time series of a statistic (coefficient of variation or
   > CV) generated by a Monte Carlo simulation of population dynamics. A
   > given simulation might generate thousands of such time series.
   >
   > For any given time series, the sequential estimates of CV are 
correlated
   > with one another (because they are based on partially overlapping sets
   > of the population numbers). Accordingly, I am using gls models with
   > autocorrelated errors to analyze each time series.
   >
   > Although the stochastic process generating all sample paths is the 
same,
   > any given sample path might consist of a time series of CV for which
   > AR(0), AR(1), AR(2), etc. generates the smallest AIC value, as produced
   > by ar() (see below).
   >
   > In practice, I have seen time series for which AR(0), AR(1), AR(2)
   > generate the smallest AIC value and so always just using a specific
   > order (e.g., AR(1)) does not seem appropriate. This is especially so
   > because I am developing methods that would apply to time series for
   > which the generating dynamics are not known.
   >
   > For any given sample path y, the code is
   >
   > #assess order of AR model using ar. use order with smallest AIC
   > arorder <- ar(moving_statistics$CV, order.max = 5)$order
   >
   > #evaluate model with constant and time
   > if (arorder < 1) {
   > summary(int_model <- lm(CV ~ 1, data = moving_statistics))
   > summary(time_model <- lm(CV ~ 1 + time, data = moving_statistics))
   >
   > #extract statistics for sample path y
   > pvalue_cv$pvalue[y] <- anova(int_model,time_model)$Pr
   > pvalue_cv$slope_time[y] <- time_model$coefficients[2]
   > } else {
   > summary(int_model <- gls(CV ~ 1, data = moving_statistics, 
correlation =
   > corARMA(p=arorder), method="ML"))
   > summary(time_model <- gls(CV ~ 1 + time, data = moving_statistics,
   > correlation = corARMA(p=arorder), method="ML"))
   > #extract statistics for sample path y
   > pvalue_cv$pvalue[y] <- anova(int_model,time_model)$p[2]
   > pvalue_cv$slope_time[y] <- time_model$coefficients[2]
   > }
   >
   >
   > When I run the simulation, there are some sample paths for which AR(2)
   > has the lowest AIC and so this is the model called, as in
   >
   > summary(time_model <- gls(CV ~ 1 + time, data = moving_statistics,
   > correlation = corARMA(p=2), method="ML"))
   >
   >
   > This call appears to always generate this error message
   >
   > Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) :
   >   Coefficient matrix not invertible
   >
   > Here are my questions:
   >
   > 1. how does ar generate an AIC value for p =2 even though the
   > coefficient matrix is not invertible and so the model fit fails?

     gls and ar are using different algorithms.  Furthermore, I think ar()
is assuming a constant mean whereas your gls() has a linear effect of
time, so they're actually trying to fit different models.

   > 2. Is there a way to actually fit such a model, say, by adjusting
   > tolerances?

     Don't know.  I would start by debugging my way through the gls code to
say where it is actually breaking.  I don't think I have much to say
other than what I already said in the SO thread you reference below.
   >
   > 3. a related thread
   > 
(http://stackoverflow.com/questions/6706143/error-with-gls-function-in-nlme-package-in-r)
   >
   >
   > indicates that there are too many parameters being estimated given the
   > length of the time series.
   >
   > Is there a numerical criterion that can be queried before the gls call
   > so that I can test for invertibility and fit, say, a lower order model
   > (e.g., AR(1)), so as to avoid an error?

     You could certainly wrap this attempt in a try() or tryCatch() clause
so that it wouldn't break your simulation run and you could then fall
back on AR(1) ...

   >
   > I cannot figure out from the code for gls what actually generates the
   > error message. A pointer to the specific numerical criterion that
   > generates this error would be very much appreciated.

     Unpacking the nlme source code and searching for the error message
finds it on line 557 of src/corStruct.c:

      F77_CALL(dqrdc2) (coef, &P, &P, &P,  &sqrt_eps, &i, qraux, pivot, w\
ork);
               if (i < P)
                   error(_("Coefficient matrix not invertible" ));

https://svn.r-project.org/R/trunk/src/appl/dqrdc2.f

says that dqrdc2 is a Householder transformation, but you'd really have
to figure this out for yourself ...

     the i parameter corresponds to a k argument inside the FORTRAN
function: "k contains the number of columns of x judged to be linearly
independent." so it looks like you end up with a rank-deficient matrix
at some point.
   >
   > In the stackoverflow thread, this is mentioned:
   >
   > The general rule of thumb is that you should have at least 10 times as
   > many data points as parameters, and that's for standard fixed
   > effect/regression parameters. (Generally variance structure parameters
   > such as AR parameters are even a bit harder/require a bit more data 
than
   > regression parameters to estimate.)
   >
   > Where in the literature is this rule discussed?

     I got it from Frank Harrell's _Regression Modeling Strategies_ book.

   > 4. If the only option is to fit a lower order model of error, say,
   > AR(1), instead of AR(2), what kind of bias does this generate in the
   > analysis?
   >

      Don't know.  Since you're running simulations, you could find out ...
   >
   >
   > Many thanks in advance,
   >
   > S.
   >

-- 
Steven Orzack
Fresh Pond Research Institute
173 Harvey Street
Cambridge, MA 02140
617 864-4307

www.freshpond.org