[R] GAM model selection and dropping terms based on GCV

Mon Dec 4 15:14:45 CET 2006

On Monday 04 December 2006 12:30, aditya gangadharan wrote:
> Hello,
> I have a question regarding model selection and dropping of terms for GAMs
> fitted with package mgcv. I am following the approach suggested in Wood
> (2001), Wood and Augustin (2002).
>
> I fitted a saturated model, and I find from the plots that for two of the
> covariates, 1. The confidence interval includes 0 almost everywhere
> 2. The degrees of freedom are NOT close to 1
> 3. The partial residuals from plot.gam don’t show much pattern visually (to
> me) 4. When I drop either or both of the terms, the GCV score increases;
>
> This is my main problem: how much of an increase in GCV is ‘acceptable’
> when terms are dropped? In the above case, the delta GCV scores are .03,
> .06 and .11 when I drop covariate A, covariate B and both respectively from
> the full model. I would be very grateful for any advice on this.
- I'm not sure that there is really an answer to this. GCV  is based on 
minimizing some approximation to the expected prediction error of the model. 
So to answer the question you'd need to do something like decide how much 
increase from `optimal' prediction error you would be prepared to tolerate. 
I think that it's not all that easy to come up with a nice way of blending  
prediction error based approaches to model selection, with approaches based 
on finding a model that is somehow the simplest model consistent with the 
data (but perhaps other people will comment on this). 

- That said, there is certainly an issue relating to the fact that the GCV 
score (or AIC, in fact) is rather asymmetric, so that random variability in 
the score tends to lead more readily to overfitting than to underfitting. 
This suggests that in fact prediction error performance at finite sample 
sizes may be improved by shrinking the smoothing parameters themselves. With 
`mgcv::gam' you can do this by increasing the `gamma' parameter above it's 
default value, which favours smoother models by making each model degree of 
freedom count as gamma degrees of freedom in the GCV score (or AIC/UBRE). It 
is possible to choose `gamma' by e.g. 10-fold cross-validation, but that 
requires some coding.

- There are more discussions of GAM model selection in various mgcv help files 
and my book. See help("mgcv-package") for details of which pages, and the 
reference. 

My bottom line on model seelction is to use things like GCV, AIC, confidence 
interval coverage and approximate p-values for guidance, but not as the basis 
for rules... modelling context has to play a part as well. 

Sorry if that's all a bit vague.

Simon

-- 
> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK
> +44 1225 386603  www.maths.bath.ac.uk/~sw283