[R-sig-eco] glm-model evaluation

Mon Jun 2 23:10:03 CEST 2008

I was hoping that someone well versed in the theory at the interface
of statistics and machine learning would take over, but since there
were no responders I'll give it a go, relying heavily on a quick
re-reading of Ch 7 of:

@book{hastie2001esl,
  title={{The Elements of Statistical Learning: Data Mining,
Inference, and Prediction}},
  author={Hastie, T. and Tibshirani, R. and Friedman, J.},
  year={2001},
  publisher={Springer}
}

I'll make a few comments in-line below, and then discuss some of the
main issues as I understand them.  I'll try to wrap it all up so we
stay relevant to the original question.

On Fri, May 30, 2008 at 9:15 PM, David Hewitt <dhewitt37 at gmail.com> wrote:
>
> We've mostly gotten out of the area where I know enough statistically to
> speak with confidence, but I'll risk some lumps anyway...
>
> I always thought that the idea of retaining a portion of the data for
> validation was a good idea. I asked David Anderson about this personally and
> he said he couldn't see any reason to do that. Using likelihood, he thought
> the best approach was to use all the data to determine the best model.

I agree that all of the data should be used to fit the best model, but
ideally not all of it used to select the best model.

>
> I'm pretty muddy on the difference between selecting a good model with AIC
> (which is sometimes referred to as being predictive in nature) and what is
> meant by post-hoc validation of predictive ability (aside from testing on
> another data set). I've often seen the "leave-one-out" approach used to
> "validate" a model. If anyone has a good reference that differentiates the
> two with an example, I'd really appreciate it.

The leave-one-out approach is a poor choice for model assessment
because the datasets the model is fit to are nearly identical,
resulting in a high-variance estimate.  A good reference for these
issues is the Hastie et al reference given above.  For a more
practical S/R approach with less focus on machine learning/data mining
and more on classes of models commonly used in ecology, there is some
useful validation information in

@book{harrell2001rms,
  title={{Regression Modeling Strategies: With Applications to Linear
Models, Logistic Regression, and Survival Analysis}},
  author={Harrell, F.E.},
  year={2001},
  publisher={Springer}
}

On Sun, Jun 1, 2008 at 7:19 AM, Ruben Roa Ureta <rroa at udec.cl> wrote:

> I think it is a matter of principles. In my view statistical inference
> theory only covers estimation of parameters and prediction of new data
> GIVEN a model, whereas model selection requires a larger theory. The AIC
> fits very well in this view since Akaike´s theorem joins statistical
> inference theory with information theory. These two theories together
> provide the tools to make model selection (or model identification, sensu
> Akaike).

I'm not sure I understand how my comments about validating a model (or
an ensemble of models) intended to have predictive ability fit into
this.

> I agree with Anderson that I would use always all my data to best fit my
> model with the likelihood. Cross-validation is ad hoc whereas the AIC is
> grounded on solid theory.

Yes, I agree that all of the data should be used in *fitting* the best
model (regardless of whether you are using a likelihood based
approach).  I do not agree that cross-validation is not grounded in
solid theory -- there is an abundance of theory, much of it developed
by statisticians (including Brad Efron, Seymour Geisser, and many
others cited in the references given above).

More generally, I think it's worth distinguishing model selection from
model assessment.  AIC, AICc, BIC, Cp, etc are model selection tools.
We can qualify this even more, I believe, by saying that they are
tools designed to compare relative estimated predictive ability for
(as Hastie et al say on pg 203) "a special class of estimates that are
linear in the parameters".   All of these tools can be shown to
estimate the optimism caused by overfitting the data, and then adding
that value to the observed error in the training data.  Note that the
optimism is the expected difference between the in-sample prediction
error (i.e. error conditional on the observed values of the
predictors) and the observed error in the training data. On the other
hand the cross-validation methods (including various bootstrap
estimates of prediction error) directly estimate the true prediction
error (not conditional of the observed values of the predictors).  For
model selection it is reasonable to estimate the in-sample error
because it is the relative differences in errors that matters not the
actual values of the errors, but for general assessment of predictive
accuracy a direct estimate of the "extra-sample error" via the
cross-validation and bootstrap methods is generally better.  Another
issue to keep in mind is that the information criteria are based on a
likelihood and so come along with a suite of assumptions, whereas
cross-validation is non-parametric.

Now to bring this all back to the original question.  The poster
stated that he had selected a model via AIC tables and expressed a
desire to determine how "good" the model was.  In my experience
AIC-type tables are often used by folks who don't have a good
understanding of what's going on under the hood (clearly not many
people have the time and energy it requires to really understand the
models, the probability distributions and likelihoods, the
assumptions, the connection to information theory or Bayesian priors
and posteriors and ratios thereof etc.).  A common mistake is to
assume that if a model has a "good" *IC score relative to the other
models in their list it is a "good" model.  Ben Bolker gave some good
advice for checking how the model is doing: the GoF on the global
model, the distributions of errors within groups, linearity,
leverages, outliers etc.  There are plenty of assumptions that come
along with the modeling process and it is up to the modeler to
demonstrate that the model meets them adequately (for some definition
of adequately).  My point was just to add that if the model is
intended to have predictive ability, there are tools out there to
assess that ability. Unfortunately there is no one-size-fits-all
algorithm for how to do this.  I mentioned that ideally there is
enough data so that if the validation tools are going to effect the
final choice of a model, then the data are split into 3 groups: test,
validation and training.  It seems these days there are more and more
datasets that are very large and this luxury is feasible.  It's
completely context-dependent as to how large is large enough, but
there is almost always a law of diminishing returns with sample size
(think about the variance of x-bar for a sample size of 1 and sample
size of 2 -- which cuts the variance in half -- and then think about
how little effect on the variance going from a sample size of 100 to
101 has), so at some point holding out enough independent data to get
a low-variance estimate of predictive ability (again, it's context
dependent how you define predictive ability) is the 'right' thing to
do.  Even if you don't have that luxury, for the reasons described
above, using an internal cross-validation technique such as the tools
offered in Harrell's Design package or the errorest function in ipred
(and a search of R for 'cross-validation' will reveal others) can
often produce very helpful estimates of the predictive ability of your
model.

All that said, I'll end by throwing in my opinion that if the goal is
prediction and not inference and interpretation of model parameters I
would probably not use an AIC-type table.  Model averaging with an AIC
table helps, but there are usually better ways. The 'right' tool
depends on the type of predictions wanted, but here are a few packages
I like: gbm, mboost, nnet, randomforest,  and e1071.

Also there is a task view for machine learning:
http://cran.r-project.org/web/views/MachineLearning.html

Finally, here's a fun real world application of predictive tools
(they're getting pretty close to the US$1Mil prize):

http://www.netflixprize.com/leaderboard

I was happy to see that the folks at the top are at least as much
statisticians as they are computer scientists ;-)

best,

Kingsford Jones