Hi L.Y,

Thank you for your advice.

Are you talking about Trevor Hastie's gam()?

I did not see anywhere from the result that it has an automatic Cross
Validation?

I also could not verify that the gam() function will automatically find the
degree-of-freedom if I don't specify the df, and just use
tems such as

s(col1) + s(col2) ...

Does the "step()" function also include the gam() with CV and auto-tweaking
for df?

I wondered if I have called "step()" correctly, because it looks to me that
it only run at a very short time(1second), and immediately returned two
models, in fact has even larger residual deviance than the model I have
provided to it initially... (obviously I've included every possibilities in
the initial model, and rely on the step() function to cut off some terms for
me...)

Thanks a lot!


On 3/16/06, Dr L. Y Hin <lyhin@netvigator.com> wrote:
>
> The engine of gam() lies in a function called smooth.spline() that is
> found
> in the
> library splines. If you leave out specifying the degree of freedom in the
> formulary determination,
> it will automatically specify it for you via cross-validation. The results
> of model fit obtainable via
> summary(mygam) will show you the "degree of freedom as choosen by the
> cross-validation method".
> On a more philosophical plane, Buja et al. (Ann Stat. 1989;17(2):453-510)
> pointed out that the fact
> that linear smoothers such as cubic splines and smoothing splines are
> linear
> lies in the fact that
> they are x-dependent and not y-dependent. By using cross-validation, you
> will invariably involve the
> use of y, which renders the determination of degree of freedom
> y-dependent,
> hence the smoothing
> parameter \lambda y-dependent, and for such a case, the smoothing matrix,
> strictly speaking,
> non-linear becasue S= (I + \lambda * K)^-1 in the non weighted form with
> unique x-points.
>
> If you increase the degree of freedom, the \lambda decreases, to a point
> where you will efffectively
> have a straightforward interpolation of points on the graph. Conversely,
> if
> \lambda is increased,
> the smoothing line reduces to a linear regression line through all the
> points.
>
> In my opinion, AIC and Residual sum of squares are competing tools looking
> for the best fit.
> The minimum of AIC and that of RSS may not concur. If you believe in AIC,
> then I would assume
> you also believe that it is a better tool than RSS in that the former uses
> an information theoretic
> approach, which is not sensitive to offset in accuracy due to penalization
> of outliers. Following that,
> I would disregard RSS and go according to what AIC tells me.
>
> I don't think you have used step.gam incorrectly, but I think you have
> been
> observant enough to
> realize not all statistical tools agree all the times :)
>
> Lin
>
> ----- Original Message -----
> From: "Michael" <comtech.usa@gmail.com>
> To: <R-help@stat.math.ethz.ch>
> Sent: Thursday, March 16, 2006 5:30 PM
> Subject: [R] Did I use "step" function correctly? (Is R's step()
> functionreliable?)
>
>
> > Hi all,
> >
> > I put up an exhaustive model to use R's "step" function:
> >
> > ------------------------
> >
> > mygam=gam(col1 ~ 1
> > + col2     + col3     + col4
> > + col2 ^ 2 + col3 ^ 2 + col4 ^ 2
> > + col2 ^ 3 + col3 ^ 3 + col4 ^ 3
> > + s(col2, 1) + s(col3, 1) + s(col4, 1)
> > + s(col2, 2) + s(col3, 2) + s(col4, 2)
> > + s(col2, 3) + s(col3, 3) + s(col4, 3)
> > + s(col2, 4) + s(col3, 4) + s(col4, 4)
> > + s(col2, 5) + s(col3, 5) + s(col4, 5)
> > + s(col2, 6) + s(col3, 6) + s(col4, 6)
> > + s(col2, 7) + s(col3, 7) + s(col4, 7)
> > + s(col2, 8) + s(col3, 8) + s(col4, 8)
> > + s(col2, 9) + s(col3, 9) + s(col4, 9),
> > data=X);
> >
> > mystep=step(mygam);
> >
> > ---------------------
> > After a long list, the following are two lowest AIC:
> >
> > Step:  AIC= 152.1
> > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4, 3)
> >
> >
> > Step:  AIC= 153.45
> > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3)
> > -----------------------------------------------
> >
> > However, the lowest AIC model,  " col1 ~ col2 + col3 + col4 + s(col2, 3)
> +
> > s(col3, 3) + s(col4, 3)" does not give the best Residual Deviance.
> >
> > Instead, the model "mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4,
> > 6),
> > data=X)" is the best, in fact,
> >
> > I found that as I increase the "degree-of-freedom", it always give
> better
> > residual deviance, lower than that of the "best" model returned by
> "step"
> > function... Please see below.
> >
> > I am wondering if I need to increase "degree-of-freedom" all the way
> up...
> > Perhaps to avoid overfitting, I should do a cross validation. Is there
> an
> > automatic Cross Validation inside "step" or "gam"?
> >
> > Is "step" function result reliable? Or perhaps I used it incorrectly?
> >
> > Thanks a lot,
> >
> > Michael.
> >
> > --------------------------
> >
> >>
> >> mygam1=gam(col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) +
> s(col4,
> > 3), data=X);
> >>
> >> mygam2=gam(col1 ~ col2 + col3 + col4 , data=X);
> >>
> >> mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6), data=X);
> >>
> >> mygam1
> > Call:
> > gam(formula = col1 ~ col2 + col3 + col4 +
> >    s(col2, 3) + s(col3, 3) + s(col4, 3), data = X)
> >
> > Degrees of Freedom: 110 total; 100.9999 Residual
> > Residual Deviance: 20.98365
> >> mygam2
> > Call:
> > gam(formula = col1 ~ col2 + col3 + col4, data = X)
> >
> > Degrees of Freedom: 110 total; 107 Residual
> > Residual Deviance: 27.84808
> >> mygam3
> > Call:
> > gam(formula = col1 ~ s(col2, 6) + s(col3, 6) +
> >    s(col4, 6), data = X)
> >
> > Degrees of Freedom: 110 total; 91.99957 Residual
> > Residual Deviance: 18.45776
> >>
> >> anova(mygam1, mygam2, mygam3);
> > Analysis of Deviance Table
> >
> > Model 1: col1 ~ col2 + col3 + col4 + s(col2,
> >    3) + s(col3, 3) + s(col4, 3)
> > Model 2: col1 ~ col2 + col3 + col4
> > Model 3: col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6)
> >  Resid. Df Resid. Dev       Df Deviance P(>|Chi|)
> > 1  100.9999    20.9836
> > 2  107.0000    27.8481  -6.0001  -6.8644 6.115e-06
> > 3   91.9996    18.4578  15.0004   9.3903 3.958e-05
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
>
>

	[[alternative HTML version deleted]]