[R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization
Jan Holstein
jan.holstein at awi.de
Thu Apr 25 11:45:02 CEST 2013
Juliet,
for you the diagnostic plots:
just to recall:
the first model was this:
fit<-gam(target
~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F)
> summary(fit)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.724 7.462 -0.633 0.527
Approximate significance of smooth terms:
edf Ref.df F p-value
s(mgs) 3.118 3.492 0.099 0.974
s(gsd) 6.377 7.044 15.596 <2e-16 ***
s(mud) 8.837 8.971 18.832 <2e-16 ***
s(ssCmax) 3.886 4.051 2.342 0.052 .
---
R-sq.(adj) = 0.403 Deviance explained = 40.6%
REML score = 33186 Scale est. = 8.7812e+05 n = 4511
(I slightly shortened the output)
Also of interest:
Model error as root mean squared error (RMSE):
> sqrt(mean(residuals.gam(fit,type="response")^2))
[1] 934.6647
Here are diagnostic plots:
<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-1.png>
<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-2.png>
Here Simons comment to this particular model from Apr 18, 2013; 5:25pm (see
above)
"The p-value computations are based on
the approximation that things are approximately normal on the linear
predictor scale, but actually they are no where close to normal in this
case, which is why the p-values look inconsistent. The reason that the
approximate normality assumption doesn't hold is that the model is quite
a poor fit. If you take a look at gam.check(fit) you'll see that the
constant variance assumption of quasi(link=log) is violated quite badly,
and the residual distribution is really quite odd (plot residuals
against fitted as well). Also see plot(fit,pages=1,scale=0) - it shows
ballooning confidence intervals and smooth estimates that are so low in
places that they might as well be minus infinity (given log link) -
clearly something is wrong with this model! "
Following Simons advice (quote):
"try Tweedie(p=1.5,link=log) as the family. Also the predictor
variables are very skewed which is giving leverage problems, so I would
transform them to give less skew. e.g. Something like "
fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)),
+ family=Tweedie(p=1.6,link=log),data=wspe1,method="REML")
> summary(fit)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.02654 0.05231 76.97 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(log(mgs)) 6.067 7.292 12.58 <2e-16 ***
s(I(gsd^0.5)) 4.009 5.138 18.25 <2e-16 ***
s(I(mud^0.25)) 7.210 8.240 58.54 <2e-16 ***
s(log(ssCmax)) 8.407 8.764 74.87 <2e-16 ***
R-sq.(adj) = 0.303 Deviance explained = 51%
REML score = 14355 Scale est. = 27.702 n = 4511
(I slightly shortened the output)
RMSE did not improve:
> sqrt(mean(residuals.gam(fit,type="response")^2))
[1] 1009.268
diagnostic plots in the following
<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-3.png>
<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-4.png>
wich looks much better.
The QQ-plot is closer to identity,
the residuals are more evenly spread and much smaller.
Still, the correlation of response and fitted values seems pretty low
Hope this helps,
Jan
--
View this message in context: http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4665370.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list