[R] GAM model selection and dropping terms based on GCV
Simon Wood
s.wood at bath.ac.uk
Mon Dec 4 15:14:45 CET 2006
On Monday 04 December 2006 12:30, aditya gangadharan wrote:
> Hello,
> I have a question regarding model selection and dropping of terms for GAMs
> fitted with package mgcv. I am following the approach suggested in Wood
> (2001), Wood and Augustin (2002).
>
> I fitted a saturated model, and I find from the plots that for two of the
> covariates, 1. The confidence interval includes 0 almost everywhere
> 2. The degrees of freedom are NOT close to 1
> 3. The partial residuals from plot.gam don’t show much pattern visually (to
> me) 4. When I drop either or both of the terms, the GCV score increases;
>
> This is my main problem: how much of an increase in GCV is ‘acceptable’
> when terms are dropped? In the above case, the delta GCV scores are .03,
> .06 and .11 when I drop covariate A, covariate B and both respectively from
> the full model. I would be very grateful for any advice on this.
- I'm not sure that there is really an answer to this. GCV is based on
minimizing some approximation to the expected prediction error of the model.
So to answer the question you'd need to do something like decide how much
increase from `optimal' prediction error you would be prepared to tolerate.
I think that it's not all that easy to come up with a nice way of blending
prediction error based approaches to model selection, with approaches based
on finding a model that is somehow the simplest model consistent with the
data (but perhaps other people will comment on this).
- That said, there is certainly an issue relating to the fact that the GCV
score (or AIC, in fact) is rather asymmetric, so that random variability in
the score tends to lead more readily to overfitting than to underfitting.
This suggests that in fact prediction error performance at finite sample
sizes may be improved by shrinking the smoothing parameters themselves. With
`mgcv::gam' you can do this by increasing the `gamma' parameter above it's
default value, which favours smoother models by making each model degree of
freedom count as gamma degrees of freedom in the GCV score (or AIC/UBRE). It
is possible to choose `gamma' by e.g. 10-fold cross-validation, but that
requires some coding.
- There are more discussions of GAM model selection in various mgcv help files
and my book. See help("mgcv-package") for details of which pages, and the
reference.
My bottom line on model seelction is to use things like GCV, AIC, confidence
interval coverage and approximate p-values for guidance, but not as the basis
for rules... modelling context has to play a part as well.
Sorry if that's all a bit vague.
Simon
--
> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK
> +44 1225 386603 www.maths.bath.ac.uk/~sw283
More information about the R-help
mailing list