[R] Did I use "step" function correctly? (Is R's step() functionreliable?)

Thu Mar 16 18:14:23 CET 2006

The questions you ask lead to far more complex issues than you are aware of.
As a result, there are no "good" -- nor certainly any simple -- answers to
them. The deeper issue is how to choose an appropriate model, balancing
complexity (overfitting) with parsimony (predictive/scientific validity). A
few miscellaneous comments are:

1) Increasing model complexity (more parameters, more df for the model) for
nested models must **always** improve (or cannot harm, anyway) the fit; this
is just a mathematical identity, and so your comment to this effect is
basically meaningless.

2) The hard question is: does increased complexity improve the fit
**enough** to believe that it is "meaningful"?

3) AIC, BIC, cross-validation, extra sums of squares principles, anovas via
likelihood ratio tests, etc. etc. are all addressed at this question. All
are sometimes useful and sometimes produce junk.

4) There are now many statisticians and statistical learners who believe
that the the basic question -- what is the right model? -- is fundamentally
misleading. Their view is (approximately, I'm no expert) that there are
always several (many?) different models that are essentially equally good.
So they cast the issue as one of prediction and eschew a single model
altogether. Instead, they have developed various approaches to creating
"ensembles" of models that are basically just prediction engines largely
incapable of interpretation. Random forests, boosting, and bagging are some
of the buzzwords here, and R has packages for all.

Brian Ripley has at least one excellent presentation on these issues on his
website (his presentation on John Nelder's 80th birthday. Unfortunately,
while I have the paper, I no longer have the link. Perhaps he might re-post
it). You might also wish to have a look at some of Leo Breiman's writings on
these issues.

I repeat: I am not an expert on these matters, and my brief comments above
no doubt already contain distortions and inaccuracies. I would greatly
appreciate it if those with real expertise would correct any of the more
egregious errors.

Finally, if I may hazard a personal opinion regarding use of stepAIC or
other stepwise fitting methods: Beware! -- there lie dragons! They can be an
excellent way to generate complete spurious models.

Cheers,
Bert

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Michael
> Sent: Thursday, March 16, 2006 1:31 AM
> To: R-help at stat.math.ethz.ch
> Subject: [R] Did I use "step" function correctly? (Is R's 
> step() functionreliable?)
> 
> Hi all,
> 
> I put up an exhaustive model to use R's "step" function:
> 
> ------------------------
> 
> mygam=gam(col1 ~ 1
> + col2     + col3     + col4
> + col2 ^ 2 + col3 ^ 2 + col4 ^ 2
> + col2 ^ 3 + col3 ^ 3 + col4 ^ 3
> + s(col2, 1) + s(col3, 1) + s(col4, 1)
> + s(col2, 2) + s(col3, 2) + s(col4, 2)
> + s(col2, 3) + s(col3, 3) + s(col4, 3)
> + s(col2, 4) + s(col3, 4) + s(col4, 4)
> + s(col2, 5) + s(col3, 5) + s(col4, 5)
> + s(col2, 6) + s(col3, 6) + s(col4, 6)
> + s(col2, 7) + s(col3, 7) + s(col4, 7)
> + s(col2, 8) + s(col3, 8) + s(col4, 8)
> + s(col2, 9) + s(col3, 9) + s(col4, 9),
> data=X);
> 
>  mystep=step(mygam);
> 
> ---------------------
> After a long list, the following are two lowest AIC:
> 
> Step:  AIC= 152.1
>  col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4, 3)
> 
> 
> Step:  AIC= 153.45
>  col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3)
> -----------------------------------------------
> 
> However, the lowest AIC model,  " col1 ~ col2 + col3 + col4 + 
> s(col2, 3) +
> s(col3, 3) + s(col4, 3)" does not give the best Residual Deviance.
> 
> Instead, the model "mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) 
> + s(col4, 6),
> data=X)" is the best, in fact,
> 
> I found that as I increase the "degree-of-freedom", it always 
> give better
> residual deviance, lower than that of the "best" model 
> returned by "step"
> function... Please see below.
> 
> I am wondering if I need to increase "degree-of-freedom" all 
> the way up...
> Perhaps to avoid overfitting, I should do a cross validation. 
> Is there an
> automatic Cross Validation inside "step" or "gam"?
> 
> Is "step" function result reliable? Or perhaps I used it incorrectly?
> 
> Thanks a lot,
> 
> Michael.
> 
> --------------------------
> 
> >
> > mygam1=gam(col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 
> 3) + s(col4,
> 3), data=X);
> >
> > mygam2=gam(col1 ~ col2 + col3 + col4 , data=X);
> >
> > mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6), data=X);
> >
> > mygam1
> Call:
> gam(formula = col1 ~ col2 + col3 + col4 +
>     s(col2, 3) + s(col3, 3) + s(col4, 3), data = X)
> 
> Degrees of Freedom: 110 total; 100.9999 Residual
> Residual Deviance: 20.98365
> > mygam2
> Call:
> gam(formula = col1 ~ col2 + col3 + col4, data = X)
> 
> Degrees of Freedom: 110 total; 107 Residual
> Residual Deviance: 27.84808
> > mygam3
> Call:
> gam(formula = col1 ~ s(col2, 6) + s(col3, 6) +
>     s(col4, 6), data = X)
> 
> Degrees of Freedom: 110 total; 91.99957 Residual
> Residual Deviance: 18.45776
> >
> > anova(mygam1, mygam2, mygam3);
> Analysis of Deviance Table
> 
> Model 1: col1 ~ col2 + col3 + col4 + s(col2,
>     3) + s(col3, 3) + s(col4, 3)
> Model 2: col1 ~ col2 + col3 + col4
> Model 3: col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6)
>   Resid. Df Resid. Dev       Df Deviance P(>|Chi|)
> 1  100.9999    20.9836
> 2  107.0000    27.8481  -6.0001  -6.8644 6.115e-06
> 3   91.9996    18.4578  15.0004   9.3903 3.958e-05
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>