[R] Possible overfitting of a GAM

Sun Feb 17 02:39:18 CET 2008

thomas L Jones asks:

> The subject is a Generalized Additive Model. Experts caution us
> against overfitting the data, which can cause inaccurate results. 

Inaccurate *predictions*, to be more precies.  The main problem with
overfitting is that your model will capture too much of the noise in
the data along with the signal.  This noise then becomes prediction
errors.  The thing about randomness is not the absence of pattern.
Randomness can sometimes appear as a fairly striking pattern.  The
problem is that next time it's a different pattern.

> I am not a statistician (my background is in Computer
> Science). Perhaps some kind soul would take a look and vet the model
> for overfitting the data.

You haven't given us very much to go on: just plots.  To help you we
need to see what you have really done, not just what you think you've 
done. This requires us to see some code (and data wouldn't hurt, too).

> 
> The study estimated the ebb and flow of traffic through a voting
> place. Just one voting place was studied; the election was the
> U.S. mid-term election about a year ago. Procedure: The voting day
> was divided into five-minute bins, and the number of voters arriving
> in each bin was recorded. The voting day was 13 hours long, giving
> 156 bins.
> 
> See http://tinyurl.com/36vzop for the scatterplot. There is a rather
> high random variation, due in part to the fact that the bin width
> was intentionally set to be narrow, in order to improve the amount
> of timing information gathered.

A natural sort of model to consider first would have been poisson with
a log link.  Is that what you used?  You may need to be a bit careful
with overdispersion if you want realistic standard errors.

> 
> http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used,
> with the loess smoothing algorithm (locally weighted
> regression). The default span was used. http://tinyurl.com/34av6l
> gives the scatterplot and the fitted curve. The two seem to match
> reasonably well.
> 

This looks pretty reasonable to me.

> However, when I tried to generate the standard errors, things went
> awry.  (Please see http://tinyurl.com/38ej2t ) There are three
> curves, seemingly the fitted curve and the curves for plus and minus
> two standard errors. The shapes seem okay, but there are large
> errors in the y values.

How did you "try to generate standard errors"?  This is where actual 
code becomes important to work out what you have really done.

This looks to me like a plot of the additive component of the model in
the log scale, with standard errors on that.  This would explain why
the component is on a totally different scale to the one you show
above (there you had the response scale), and in particular why it
goes negative.    That would also account for the apparent distortion
in the curve itself relative to its image on the response scale. 
Components, by construction, have mean zero.  It's the intercept that 
lifts them to the right level for predictions, and the inverse link 
that takes them back to the response scale.  

> 
> Question: Have I overfitted the data?

Most likely not.  You may need to understand the model you are fitting
a bit more, though, as well as the tools.

> 
> Feedback?
> 
> Tom
> Thomas L. Jones, PhD, Computer Science
>