[R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization
Simon Wood
s.wood at bath.ac.uk
Thu Apr 18 17:25:44 CEST 2013
Jan,
Thanks for the data (off list). The p-value computations are based on
the approximation that things are approximately normal on the linear
predictor scale, but actually they are no where close to normal in this
case, which is why the p-values look inconsistent. The reason that the
approximate normality assumption doesn't hold is that the model is quite
a poor fit. If you take a look at gam.check(fit) you'll see that the
constant variance assumption of quasi(link=log) is violated quite badly,
and the residual distribution is really quite odd (plot residuals
against fitted as well). Also see plot(fit,pages=1,scale=0) - it shows
ballooning confidence intervals and smooth estimates that are so low in
places that they might as well be minus infinity (given log link) -
clearly something is wrong with this model!
I would be inclined to reset all the 0's to 0 (rather than 0.01), and
then to try Tweedie(p=1.5,link=log) as the family. Also the predictor
variables are very skewed which is giving leverage problems, so I would
transform them to give less skew. e.g. Something like
fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)),
family=Tweedie(p=1.6,link=log),data=df,method="REML")
gives a model that is closer to being reasonable (p-values are then
consistent between select=TRUE and FALSE).
best,
Simon
On 18/04/13 14:24, Simon Wood wrote:
> Jan,
>
> Thanks for this. Is there any chance that you could send me the data off
> list and I'll try to figure out what is happening? (Under the
> understanding that I'll only use the data for investigating this issue,
> of course).
>
> best,
> Simon
>
> on 18/04/13 11:11, Jan Holstein wrote:
>> Simon,
>>
>> thanks for the reply, I guess I'm pretty much up to date using
>> mgcv 1.7-22.
>> Upgrading to R 3.0.0 also didn't do any change.
>>
>> Unfortunately using method="REML" does not make any difference:
>>
>> ####### first with "select=FALSE"
>>> fit<-gam(target
>>> ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F)
>>>
>>> summary(fit)
>>
>> Family: quasi
>> Link function: log
>> Formula:
>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax)
>> Parametric coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) -4.724 7.462 -0.633 0.527
>> Approximate significance of smooth terms:
>> edf Ref.df F p-value
>> s(mgs) 3.118 3.492 0.099 0.974
>> s(gsd) 6.377 7.044 15.596 <2e-16 ***
>> s(mud) 8.837 8.971 18.832 <2e-16 ***
>> s(ssCmax) 3.886 4.051 2.342 0.052 .
>> ---
>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> R-sq.(adj) = 0.403 Deviance explained = 40.6%
>> REML score = 33186 Scale est. = 8.7812e+05 n = 4511
>>
>>
>>
>>
>>
>> #### Then using "select=T"
>>
>>> fit2<-gam(target
>>> ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=TRUE)
>>>
>>> summary(fit2)
>> Family: quasi
>> Link function: log
>> Formula:
>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax)
>> Parametric coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) -6.406 5.239 -1.223 0.222
>> Approximate significance of smooth terms:
>> edf Ref.df F p-value
>> s(mgs) 2.844 8 25.43 <2e-16 ***
>> s(gsd) 6.071 9 14.50 <2e-16 ***
>> s(mud) 6.875 8 21.79 <2e-16 ***
>> s(ssCmax) 3.787 8 18.42 <2e-16 ***
>> ---
>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> R-sq.(adj) = 0.4 Deviance explained = 40.1%
>> REML score = 33203 Scale est. = 8.8359e+05 n = 4511
>>
>>
>>
>>
>>
>>
>>
>> I played around with other families/link functions with no success
>> regarding
>> the "select" behaviour.
>>
>> Well, look at the structure of my data:
>> <http://r.789695.n4.nabble.com/file/n4664586/screen-capture-1.png>
>>
>> All possible predictor variables in principle look like this, and taken
>> alone, each and every is significant according to p-value (but not all
>> can
>> at the same time).
>> In theory, the target variable should be a hypersurface in 11dim space
>> with
>> lots of noise, but interaction of more than 2 vars gets costly (not to
>> think
>> of 11) and often enough (also without interaction) the solution does not
>> converge at minimal step size. If it does, results are usually not as
>> good
>> as without interaction.
>>
>> Any comment/advice on model setup is warmly welcome here.
>>
>> Since I don't want to try out all possible 2047 combinations of up to
>> eleven
>> predictor variables for each target variable, I currently see no other
>> way
>> than educated manual guessing.
>>
>> If you know another way of (semi-)automated model tunig/reduction, I
>> would
>> very much appreciate it
>>
>> best regards,
>> Jan
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4664586.html
>>
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603 http://people.bath.ac.uk/sw283
More information about the R-help
mailing list