[R-sig-eco] gam variable selection

Gavin Simpson gavin.simpson at ucl.ac.uk
Wed Sep 28 13:04:28 CEST 2011


On Wed, 2011-09-28 at 10:44 +0100, Rebecca Ross wrote:
> Hi Marco,
> Having recently been working with gams myself I would suggest a
> procedure whereby you build your model in a forward stepwise approach
> first, having run individual gams for each of your variables and
> selecting the significant variable with the best AIC as your first
> variable, and iteratively trying out the other variables as 2nd in the
> gam, selecting the combination with the best AIC, and repeating until
> you get no further AIC improvement.

Forward selection/backward elimination should be avoided at all costs.
By doing this sort of inclusion/elimination you are introducing
selection bias into the "coefficients" for the terms in the models. You
are explicitly setting some to zero be excluding them from the model and
this biases the other "coefficients".

Many quantitative ecologists bemoan the continued use of step-wise
feature selection procedures by ecologists. See for example Whittingham
et al. (2006).

Marra and Wood (2011) look at a backwards elimination strategy as part
of their comparison of variable selection methods for GAMs. IIRC, it
performs badly.

> I found it advisable to always first run each gam with all smooth
> functions applied (and with number of knots restricted to avoid
> overfitting the model using the term k=4 for 4 knots e.g.
> gam(x~s(y,k=4)+s(z,k=4), family=Gaussian)) then check the plots for
> each of your variables and rerun each model with linear functions
> applied as advised by the plots.

By restricting the dimension of the basis functions to such low levels,
you are making an explicit statement about the forms of model that the
GAM can fit. This is fine if you have knowledge to guide this process,
say from previous work etc. that suggests such forms for the fitted
smooths, but if not, you are forcing a very restrictive set of models
that can be fitted.

> Also remember to throw out significantly correlated variables once one
> of your correlates has been selected.

This is an important point - known as concurvity in additive models.
mgcv has a function to compute some measures that might indicate the
presence of concurvity, but this is more involved than just looking for
correlated variables - note that linear correlation is not much use when
you are allowing for non-linear relationships between variables.

> The backwards stepwise model build could then be run to check the
> forwards build and using a global model that has excluded the thrown
> out correlates.
> 
> Also worth knowing, but not worth relying on, is that there is a
> function called "dredge" which will run through your global model and
> list the potential model builds in order of best AIC. This is a
> variable selection algorithm but it does not take into account
> correlates or significance so it is best used only as advice and
> another check for a longhand build.

The idea here is that one can average the predictions from the set of
best candidate models - not use it as a means to find the best set of
predictors for a single model.

The paper by Marra and Wood (2011), whilst being somewhat technical in
places, is a excellent resource for comparing the various means
available for doing feature selection in GAMs. Whilst theirs is but one
study, the general result appears to be that adding an extra penalty to
the penalised regression solved by mgcv:::gam(), which allows variables
to be shrunk out of the model entirely, is a robust and powerful means
of identifying important features.

Couple this with fitting via REML or ML (not the default GCV) as GCV can
overfit and we now have very good guides as to how to perform feature
selection in, and fit, GAMs via the penalised regression approach of
Simon Wood as implemented in his mgcv package.

G

Refs:

Marra G., Wood S.N. (2011) Practical variable selection for generalized
additive models. Computational Statistics and Data Analysis 55;
2372-2387

Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP (2006) Why do we
still use step-wise modelling in ecology and behaviour? J Animal Ecol
75:1182-1189

> All the best,
> Bex
> 
> Research Assistant 
> University of Plymouth
> 
> 
> 
> 
> -----Original Message-----
> From: r-sig-ecology-bounces at r-project.org [mailto:r-sig-ecology-bounces at r-project.org] On Behalf Of r-sig-ecology-request at r-project.org
> Sent: 27 September 2011 11:00
> To: r-sig-ecology at r-project.org
> Subject: R-sig-ecology Digest, Vol 42, Issue 16
> 
> Send R-sig-ecology mailing list submissions to
> 	r-sig-ecology at r-project.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> or, via email, send a message with subject or body 'help' to
> 	r-sig-ecology-request at r-project.org
> 
> You can reach the person managing the list at
> 	r-sig-ecology-owner at r-project.org
> 
> When replying, please edit your Subject line so it is more specific than "Re: Contents of R-sig-ecology digest..."
> 
> 
> Today's Topics:
> 
>    1. gam variable selection (Marco Helbich)
>    2. Re: gam variable selection (Gavin Simpson)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 27 Sep 2011 08:54:52 +0200
> From: Marco Helbich <marco.helbich at gmx.at>
> To: r-sig-ecology at r-project.org
> Subject: [R-sig-eco] gam variable selection
> Message-ID: <4E81733C.8090700 at gmx.at>
> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> 
> Dear list,
> 
> I am studying the influence of several environmental factors (numeric &
> dummies) on species densities (= numeric) using the gam() function with a gaussian link function in the mgcv package. As stated in Wood (2006) there is no variable selection algorithm.
> 
> Is it an appropriate (iterative) approach to drop the predictor being least significant (eg. p > 0.05), refit the model, compare the GCV/AIC score and so forth. Should I first focus on the smoothing functions or fixed effects? Or is such a distinction not important at all?
> 
> Perhaps someone has more experience with GAMs and can give me a helping hand? Thanks in advance!
> 
> Best
> Marco
> --
> Marco Helbich
> Department of Geography
> University of Heidelberg
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 27 Sep 2011 10:40:27 +0100
> From: Gavin Simpson <gavin.simpson at ucl.ac.uk>
> To: Marco Helbich <marco.helbich at gmx.at>
> Cc: r-sig-ecology at r-project.org
> Subject: Re: [R-sig-eco] gam variable selection
> Message-ID: <1317116427.2714.3.camel at chrysothemis.geog.ucl.ac.uk>
> Content-Type: text/plain; charset="UTF-8"
> 
> On Tue, 2011-09-27 at 08:54 +0200, Marco Helbich wrote:
> > Dear list,
> > 
> > I am studying the influence of several environmental factors (numeric &
> > dummies) on species densities (= numeric) using the gam()
> > function with a gaussian link function in the mgcv package. As stated in
> > Wood (2006) there is no variable selection algorithm.
> > 
> > Is it an appropriate (iterative) approach to drop the predictor being
> > least significant (eg. p > 0.05), refit the model, compare the GCV/AIC
> > score and so forth. Should I first focus on the smoothing functions or 
> > fixed effects? Or is such a distinction not important at all?
> > 
> > Perhaps someone has more experience with GAMs and can give me a helping
> > hand? Thanks in advance!
> 
> You could do that, but I would be sceptical of the results.
> 
> Marra and Wood (2011, Computational Statistics and Data Analysis 55;
> 2372-2387) compare various approaches for feature selection in GAMs.
> IIRC, they concluded that an additional penalty term in the smoothness
> selection procedure gave the best results. This can be activated in
> mgcv::gam() by using the `select = TRUE` argument/setting.
> 
> HTH
> 
> G
> 
> > Best
> > Marco
> 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



More information about the R-sig-ecology mailing list