[R] GAM: Overfitting
Simon Wood
simon at stats.gla.ac.uk
Wed Dec 22 12:08:17 CET 2004
> I am analyzing particulate matter data (PM10) on a small data set (147
> observations). I fitted a semi-parametric model and am worried about
> overfitting. How can one check for model fit in GAM?
- Keeping a random subset of the data as a validation set, fitting
to the remaining data and then comparing the R^2/ proportion deviance explained
on fit set and validation set is usually quite diagnostic. If the fit data
are much better predicted than the validation data, then you probably have
over-fitting.
- If your response is treated as Poisson then scale parameter estimates
<<1 are also diagnostic, but only if you are not expecting overdispersion,
of course.
- If you use gam from package mgcv then, by default, model
effective degrees of freedom are estimated from your data by GCV or an
approximation to AIC. mgcv::gam allows you to increase the penalty on each
model degree of freedom in these criteria, via gam argument `gamma'. Some
work by Kim and Gu (2004, J.Roy.Statist.Soc.B) suggests that gamma around
1.4 can be a sensible choise for surpressing overfitting, without
much of a degredation in MSE performance.
best,
Simon
More information about the R-help
mailing list