# [R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization

Jan Holstein jan.holstein at awi.de
Thu Apr 25 11:45:02 CEST 2013

```Juliet,

for you the diagnostic plots:

just to recall:

fit<-gam(target
> summary(fit)

Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   -4.724      7.462  -0.633    0.527
Approximate significance of smooth terms:
edf Ref.df      F p-value
s(mgs)    3.118  3.492  0.099   0.974
s(gsd)    6.377  7.044 15.596  <2e-16 ***
s(mud)    8.837  8.971 18.832  <2e-16 ***
s(ssCmax) 3.886  4.051  2.342   0.052 .
---
R-sq.(adj) =  0.403   Deviance explained = 40.6%
REML score =  33186  Scale est. = 8.7812e+05  n = 4511

(I slightly shortened the output)

Also of interest:
Model error as  root mean squared error (RMSE):

> sqrt(mean(residuals.gam(fit,type="response")^2))
 934.6647

Here are diagnostic plots:

<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-1.png>

<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-2.png>

Here Simons comment to this particular model from Apr 18, 2013; 5:25pm (see
above)

"The p-value computations are based on
the approximation that things are approximately normal on the linear
predictor scale, but actually they are no where close to normal in this
case, which is why the p-values look inconsistent. The reason that the
approximate normality assumption doesn't hold is that the model is quite
a poor fit. If you take a look at gam.check(fit) you'll see that the
and the residual distribution is really quite odd (plot residuals
against fitted as well). Also see plot(fit,pages=1,scale=0) - it shows
ballooning confidence intervals and smooth estimates that are so low in
places that they might as well be minus infinity (given log link) -
clearly something is wrong with this model! "

"try Tweedie(p=1.5,link=log) as the family. Also the predictor
variables are very skewed which is giving leverage problems, so I would
transform them to give less skew. e.g. Something like "

fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)),
> summary(fit)

Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.02654    0.05231   76.97   <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df     F p-value
s(log(mgs))    6.067  7.292 12.58  <2e-16 ***
s(I(gsd^0.5))  4.009  5.138 18.25  <2e-16 ***
s(I(mud^0.25)) 7.210  8.240 58.54  <2e-16 ***
s(log(ssCmax)) 8.407  8.764 74.87  <2e-16 ***
R-sq.(adj) =  0.303   Deviance explained =   51%
REML score =  14355  Scale est. = 27.702    n = 4511

(I slightly shortened the output)

RMSE did not improve:
> sqrt(mean(residuals.gam(fit,type="response")^2))
 1009.268

diagnostic plots in the following

<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-3.png>

<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-4.png>

wich looks much better.
The QQ-plot is closer to identity,
the residuals are more evenly spread and much smaller.
Still, the correlation of response and fitted values seems pretty low

Hope this helps,

Jan

--
View this message in context: http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4665370.html
Sent from the R help mailing list archive at Nabble.com.

```