[R-sig-eco] Stepwise algorithm for GAM

Gavin Simpson gavin.simpson at ucl.ac.uk
Tue May 24 10:42:10 CEST 2011


On Tue, 2011-05-24 at 07:45 +0200, Zoltan Botta-Dukat wrote:
> Hi,
> 
> there is no automatic variable selection in the mgcv package. You should 
> remove the superfluous terms manually. You can choose them using ML-test 
> , comparing AIC values or using plot function.
> 
> An example:
> 
> set.seed(3)
> n<-200
> ## simulate data
> dat <- gamSim(1,n=n,scale=.15,dist="poisson")
> str(dat)
> ## spurious predictors
> dat$x4 <- runif(n, 0, 1)
> dat$x5 <- runif(n, 0, 1)
> 
> b1<-gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5),data=dat,family=poisson) # 
> full model
> summary(b1)  # you can choose superfluous predictors based on this output
> b2<-gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4),data=dat, family=poisson) # 
> reduced model without x5
> anova(b,b2,test="Chisq") # comparing the two models

anova(b1, b2, test = "Chisq")

> plot(b1,pages=1) # smooth function is a nearly horizontal line for 
> superfluous predictors
> 
> Setting select=T may give more clear pattern, however in my toy-example 
> the difference is small.

Small difference in terms of the selected model perhaps, but I think the
difference between manual selection and the penalisation using `select =
TRUE` is vast. In the former you have used the training data to inform
model selection - the standard errors and p-values know nothing of this
selection and are thus biased. In the latter with `select = TRUE`, an
additional penalty term in the smoothness selection is optimised over
during model fitting. The p-values, whilst still approximate, are at
least interpretable in that case.

G

> Best wishes
> 
> Zoltan
> 
> 2011.05.24. 5:21 keltezssel, ARISTIDES LOPEZ rta:
> > Hello all,
> >
> >
> > Just a question, Im trying to fit my model throughout stepwise
> > selection.At this point (with the valuable help of Gavin and Ben) my
> > model are like
> > this:
> >
> >
> > model 1<-gam(Young (No. ind)~s(Lat, k=6)+s(Long, k=6)+s(Deep, k=6)+s(Area
> > (km2),k=6)+as.factor (year),family=poisson,data=L. synagris)
> >
> >
> > I have 4 species * 3 groups (young, adult and total) * 5 explanatory
> > variables (Lat, Lon, Deep, Area, Year). So Im looking for a stepwise
> > algorithm  that help me to select the best model. I tried with step () in
> > the stats package but R give me the following error message:
> >
> >
> > "Error en glm.control(irls.reg = 0, epsilon = 1e-06, maxit = 100, trace =
> > FALSE,  : el argumento(s) no fue utilizado(s) (irls.reg = 0, mgcv.tol =
> > 1e-07, mgcv.half = 15,..............."
> >
> >
> > Any suggestion?
> >
> >
> > Cheers
> >
> >
> > Date: Wed, 18 May 2011 10:53:41 -0500
> > From: ARISTIDES LOPEZ<aristideslpz at gmail.com>
> > To: r-sig-ecology at r-project.org
> > Subject: [R-sig-eco] Error message in GAM
> > Message-ID:<BANLkTikz-dQ=jV9YkfTGgEYO5uBWmcUsMw at mail.gmail.com>
> > Content-Type: text/plain
> >
> > Dear members list,
> >
> > I'm trying to make a model for descrive the distribution of demersal fishes
> > in the Colombian Caribbean Sea. I have a data set of n= 56, the model is
> > like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem is
> > that R give me the error message *"Model has more coefficients than data"*.
> >
> > Anybody knows how can avoid this?
> >
> > Faithfully.
> >
> > --
> > Aristides Lpez-Pea
> >
> >
> >
> > Date: Wed, 18 May 2011 17:48:04 +0100
> > From: Gavin Simpson<gavin.simpson at ucl.ac.uk>
> > To: ARISTIDES LOPEZ<aristideslpz at gmail.com>
> > Cc: r-sig-ecology at r-project.org
> > Subject: Re: [R-sig-eco] Error message in GAM
> > Message-ID:<1305737284.25148.15.camel at prometheus.geog.ucl.ac.uk>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > On Wed, 2011-05-18 at 10:53 -0500, ARISTIDES LOPEZ wrote:
> >> Dear members list,
> >>
> >> I'm trying to make a model for descrive the distribution of demersal
> > fishes
> >> in the Colombian Caribbean Sea. I have a data set of n= 56, the model is
> >> like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem is
> >> that R give me the error message *"Model has more coefficients than
> > data"*.
> >> Anybody knows how can avoid this?
> >>
> >> Faithfully.
> > Each of your smooths will be using k = 10 degrees of freedom so that is
> > 30 degrees of freedom already, which is a lot for a data set of 56
> > observations.
> >
> > Are all the data unique? i.e. you have 56 unique density values, 56
> > unique lats, 56 unique lons etc. If not, it might be the the unique
> > information in the data is not sufficient to support the complexity of
> > the smooths.
> >
> > My money would be on that you did something you haven't actually told
> > us, and have more smooths in the model than you say and they are using
> > more degrees of freedom than it appears to us.
> >
> > The easy way to try to solve the problem, will be to restrict the
> > complexity of the individual smooths:
> >
> > response ~ s(Lat, k = 6) + s(Long, k = 6) + s(deep, k = 6)
> >
> > for example.
> >
> > You could probably model these data as a Possion with an offset term for
> > the km2 covered by each sample, rather than treating these as a density.
> >
> > HTH,
> >
> > G
> >
> > --
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >   Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> >   ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> >   Pearson Building,             [e]
> > gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/>
> >   Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> >   UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 9
> > Date: Wed, 18 May 2011 17:16:10 -0500
> > From: ARISTIDES LOPEZ<aristideslpz at gmail.com>
> > To: r-sig-ecology at r-project.org, gavin.simpson at ucl.ac.uk
> > Subject: Re: [R-sig-eco] Error message in GAM
> > Message-ID:<BANLkTimUQhNjhdOX9LNNDdT60gSiWNX38w at mail.gmail.com>
> > Content-Type: text/plain
> >
> > Dear Dr. Gavin,
> >
> > Thank you very much for your help. All my data are unique (because I have 56
> > different stations). As you suggest I restrict the
> > complexity of the individual smooths:
> >
> > response ~ s(Lat, k = 9) + s(Long, k = 9) + s(deep, k = 9)
> >
> > Problem solved.
> >
> > Now I try to make other model:
> >
> >   modelo2<-gam(Density~s(year, k=6)+s(Month, k=6)+s(rainfall, k=6),
> > family=Gamma, data=at)
> >
> > The "new" problem is that R give me the next error  *"Error en
> > smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
> >   A term has fewer unique covariate combinations than specified maximum
> > degrees of freedom"*.
> >
> > Anybody knows what mean this?
> >
> > Regards.
> >
> > 2011/5/18 Gavin Simpson<gavin.simpson at ucl.ac.uk>
> >
> >> On Wed, 2011-05-18 at 10:53 -0500, ARISTIDES LOPEZ wrote:
> >>> Dear members list,
> >>>
> >>> I'm trying to make a model for descrive the distribution of demersal
> >> fishes
> >>> in the Colombian Caribbean Sea. I have a data set of n= 56, the model is
> >>> like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem
> > is
> >>> that R give me the error message *"Model has more coefficients than
> >> data"*.
> >>> Anybody knows how can avoid this?
> >>>
> >>> Faithfully.
> >> Each of your smooths will be using k = 10 degrees of freedom so that is
> >> 30 degrees of freedom already, which is a lot for a data set of 56
> >> observations.
> >>
> >> Are all the data unique? i.e. you have 56 unique density values, 56
> >> unique lats, 56 unique lons etc. If not, it might be the the unique
> >> information in the data is not sufficient to support the complexity of
> >> the smooths.
> >>
> >> My money would be on that you did something you haven't actually told
> >> us, and have more smooths in the model than you say and they are using
> >> more degrees of freedom than it appears to us.
> >>
> >> The easy way to try to solve the problem, will be to restrict the
> >> complexity of the individual smooths:
> >>
> >> response ~ s(Lat, k = 6) + s(Long, k = 6) + s(deep, k = 6)
> >>
> >> for example.
> >>
> >> You could probably model these data as a Possion with an offset term for
> >> the km2 covered by each sample, rather than treating these as a density.
> >>
> >> HTH,
> >>
> >> G
> >>
> >> --
> >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >>   Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> >>   ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> >>   Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/>
> >>   Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> >>   UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >>
> >>
> >
> > --
> > Aristides Lpez-Pea
> >
> >         [[alternative HTML version deleted]]
> >
> >
> >
> > ------------------------------
> >
> > Message: 10
> > Date: Wed, 18 May 2011 18:28:20 -0400
> > From: Ben Bolker<bbolker at gmail.com>
> > To: r-sig-ecology at r-project.org
> > Subject: Re: [R-sig-eco] Error message in GAM
> > Message-ID:<4DD44804.1020705 at gmail.com>
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > On 05/18/2011 06:16 PM, ARISTIDES LOPEZ wrote:
> >> Dear Dr. Gavin,
> >>
> >> Thank you very much for your help. All my data are unique (because I have
> > 56
> >> different stations). As you suggest I restrict the
> >> complexity of the individual smooths:
> >>
> >> response ~ s(Lat, k = 9) + s(Long, k = 9) + s(deep, k = 9)
> >>
> >> Problem solved.
> >>
> >> Now I try to make other model:
> >>
> >>   modelo2<-gam(Density~s(year, k=6)+s(Month, k=6)+s(rainfall, k=6),
> >> family=Gamma, data=at)
> >>
> >> The "new" problem is that R give me the next error  *"Error en
> >> smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
> >>    A term has fewer unique covariate combinations than specified maximum
> >> degrees of freedom"*.
> >>
> >> Anybody knows what mean this?
> >>
> >> Regards.
> >   It means you're pushing your data too hard: how about being
> > old-fashioned and fitting quadratic models [e.g. poly(Lat,2)] for each
> > of your predictor variables (this of course ignores interactions, which
> > you might ?? want to worry about in some cases -- but you probably
> > can't.  In principle, gam() in the mgcv package (which is what I assume
> > you are using) tries to adjust the degree of complexity of your model
> > downward as appropriate, but it may be having a hard time doing so; can
> > you set k lower?  For the models that do succeed, I would suspect that
> > the effective degrees of freedom fitted are much lower than the k values
> > you are specifying, so you could afford to reduce them (see ?choose.k )
> >
> >   Remember the rule of thumb that you should not be trying to fit more
> > than *at most* N/10 parameters, where N is your number of points -- so
> > quadratic models of 3 independent predictors (= 7 parameters, intercept
> > + 2 for each predictor variable) would already be overfitting slightly.
> >
> >   cheers
> >     Ben Bolker
> >
> >> 2011/5/18 Gavin Simpson<gavin.simpson at ucl.ac.uk>
> >>
> >>> On Wed, 2011-05-18 at 10:53 -0500, ARISTIDES LOPEZ wrote:
> >>>> Dear members list,
> >>>>
> >>>> I'm trying to make a model for descrive the distribution of demersal
> >>> fishes
> >>>> in the Colombian Caribbean Sea. I have a data set of n= 56, the model is
> >>>> like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem
> > is
> >>>> that R give me the error message *"Model has more coefficients than
> >>> data"*.
> >>>> Anybody knows how can avoid this?
> >>>>
> >>>> Faithfully.
> >>> Each of your smooths will be using k = 10 degrees of freedom so that is
> >>> 30 degrees of freedom already, which is a lot for a data set of 56
> >>> observations.
> >>>
> >>> Are all the data unique? i.e. you have 56 unique density values, 56
> >>> unique lats, 56 unique lons etc. If not, it might be the the unique
> >>> information in the data is not sufficient to support the complexity of
> >>> the smooths.
> >>>
> >>> My money would be on that you did something you haven't actually told
> >>> us, and have more smooths in the model than you say and they are using
> >>> more degrees of freedom than it appears to us.
> >>>
> >>> The easy way to try to solve the problem, will be to restrict the
> >>> complexity of the individual smooths:
> >>>
> >>> response ~ s(Lat, k = 6) + s(Long, k = 6) + s(deep, k = 6)
> >>>
> >>> for example.
> >>>
> >>> You could probably model these data as a Possion with an offset term for
> >>> the km2 covered by each sample, rather than treating these as a density.
> >>>
> >>> HTH,
> >>>
> >>> G
> >>>
> >>> --
> >>> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >>>   Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> >>>   ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> >>>   Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/>
> >>>   Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> >>>   UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> >>> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >>>
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> R-sig-ecology mailing list
> >> R-sig-ecology at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.10 (GNU/Linux)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> > -----END PGP SIGNATURE-----
> >
> >
> >
> > ------------------------------
> >
> > Message: 11
> > Date: Thu, 19 May 2011 07:35:39 +0100
> > From: Gavin Simpson<gavin.simpson at ucl.ac.uk>
> > To: ARISTIDES LOPEZ<aristideslpz at gmail.com>
> > Cc: r-sig-ecology at r-project.org
> > Subject: Re: [R-sig-eco] Error message in GAM
> > Message-ID:<1305786939.2773.3.camel at chrysothemis.geog.ucl.ac.uk>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > On Wed, 2011-05-18 at 17:16 -0500, ARISTIDES LOPEZ wrote:
> >> Dear Dr. Gavin,
> >>
> >> Thank you very much for your help. All my data are unique (because I have
> > 56
> >> different stations). As you suggest I restrict the
> >> complexity of the individual smooths:
> >>
> >> response ~ s(Lat, k = 9) + s(Long, k = 9) + s(deep, k = 9)
> >>
> >> Problem solved.
> >>
> >> Now I try to make other model:
> >>
> >>   modelo2<-gam(Density~s(year, k=6)+s(Month, k=6)+s(rainfall, k=6),
> >> family=Gamma, data=at)
> >>
> >> The "new" problem is that R give me the next error  *"Error en
> >> smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
> >>    A term has fewer unique covariate combinations than specified maximum
> >> degrees of freedom"*.
> > It means exactly what it says. One of the terms in the model:
> >
> >       * s(year, k = 6)
> >       * s(Month, k = 6)
> >       * s(rainfall, k = 6)
> >
> > has *fewer* then 6 unique values. Look at the outputs from
> >
> > with(at, table(year))
> > with(at, table(Month))
> > with(at, table(rainfall))
> >
> > to see which it(they) is(are).
> >
> > G
> >
> >> Anybody knows what mean this?
> >>
> >> Regards.
> >>
> >> 2011/5/18 Gavin Simpson<gavin.simpson at ucl.ac.uk>
> >>
> >>> On Wed, 2011-05-18 at 10:53 -0500, ARISTIDES LOPEZ wrote:
> >>>> Dear members list,
> >>>>
> >>>> I'm trying to make a model for descrive the distribution of demersal
> >>> fishes
> >>>> in the Colombian Caribbean Sea. I have a data set of n= 56, the model
> > is
> >>>> like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem
> > is
> >>>> that R give me the error message *"Model has more coefficients than
> >>> data"*.
> >>>> Anybody knows how can avoid this?
> >>>>
> >>>> Faithfully.
> >>> Each of your smooths will be using k = 10 degrees of freedom so that is
> >>> 30 degrees of freedom already, which is a lot for a data set of 56
> >>> observations.
> >>>
> >>> Are all the data unique? i.e. you have 56 unique density values, 56
> >>> unique lats, 56 unique lons etc. If not, it might be the the unique
> >>> information in the data is not sufficient to support the complexity of
> >>> the smooths.
> >>>
> >>> My money would be on that you did something you haven't actually told
> >>> us, and have more smooths in the model than you say and they are using
> >>> more degrees of freedom than it appears to us.
> >>>
> >>> The easy way to try to solve the problem, will be to restrict the
> >>> complexity of the individual smooths:
> >>>
> >>> response ~ s(Lat, k = 6) + s(Long, k = 6) + s(deep, k = 6)
> >>>
> >>> for example.
> >>>
> >>> You could probably model these data as a Possion with an offset term for
> >>> the km2 covered by each sample, rather than treating these as a density.
> >>>
> >>> HTH,
> >>>
> >>> G
> >>>
> >>> --
> >>> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >>>   Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> >>>   ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> >>>   Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/>
> >>>   Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> >>>   UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> >>> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >>>
> >>>
> >>
> > --
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >   Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> >   ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> >   Pearson Building,             [e]
> > gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/>
> >   Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> >   UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >
> >
> >
> >
> >
> > _______________________________________________
> > R-sig-ecology mailing list
> > R-sig-ecology at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



More information about the R-sig-ecology mailing list