[R-SIG-Finance] Perfect out-of-sample-fit in a model containing a lagged dependent variable?

Thu Oct 15 10:12:57 CEST 2009

Hello there!
I'm new to quantitative finance and now experimenting with the various 
tools. While playing with day-to-day predictions for the returns on 
closing call of the DAX, I observed a really strange behavior of my 
regression models - namely a R-Square of approx 1, which gets replicated 
by a nearly perfect fit of out-of-sample predictions for a prediction 
horizon of 57 trading days. (!) (Model details at the bottom of this mail.)

I think, the issue is connected to the lagged dependent variable 
included as predictor in model. (However, the Durbin-Watson test 
indicates no autocorrelation in the series, which itself implies 
misspecification.) Excluding this term leads to models with an R-Square 
of approx 0.4, which is not satisfying, but fits my expecations given 
the ad-hoc-model. This is also replicated in terms of out-of-sample fit.

However- there remains the question of the nearly perfect out-of-sample 
fit for the model including the AR1-term. Has anybody experienced 
similar behavior? Answers would really be appreciated!

Kind regards,
Gero

#

Model detalis:

- Model setup: linear model:  close.DAX ~ lag(close.DAX) + 
lag(close.NYSE) + lag(close.HangSeng)
- Data is the respective returns (backshifted index data)
- Datasource: Yahoo-Finance
- Training-Dataset: 4000 days back - without the last 57 trading days
- Test-Dataset: the last 57 trading days
- Max. correlation of the model-variables: 0.59
- Augmented Dickey-Fuller-Test indicates stationarity
- Durbin-Watson-Test indicates no autocorrelation (! - contratry to 
model structure)
- CumSum-Test indicates no structural change between training- and test-data
- In the fitted linear model, the AR1-term (lag(close.DAX)) dominates 
the other parameters vastly and seemingly channels all the 
intraday-correlation between the independent variables, R^2 is 1. Out of 
sample fit is close to perfect
- Residuals don't really look normally distributed but generally 
generalized-pareto (extreme value) distributed, as mean residual life 
plots indicate
- Bootstrapping the model yields lots of not accessible NA regression 
coefficients, probably due to shirinking variance in the 
bootstrap-sample. (But this is also an issue with the model excluding 
the AR1-term.)