[R] Model fitting with GAM and "by" term

Gavin Simpson gavin.simpson at ucl.ac.uk
Wed Jun 24 10:05:50 CEST 2009


On Tue, 2009-06-23 at 14:18 -0400, Paul Simonin wrote:
> Hello R Users,
>  I have a question regarding fitting a model with GAM{mgcv}. I have data 
> from several predictor (X) variables I wish to use to develop a model to 
> predict one Y variable. I am working with ecological data, so have data 
> collected many times (about 20) over the course of two years. Plotting 
> data independently for each date there appears to be relationships 
> between Y (fish density) and at least several X variables (temperature 
> and light). However, the actual value of X variables (e.g., temperature) 
> changes with date/season. In other words, fish distribution is likely 
> related to temperature, but available temperatures change through the 
> season. Thus, when data from all dates are combined to create a model 
> from the entire dataset, I think I need to include some type of 
> metric/variable/interaction term to account for this date relationship. 
> I have written the following code using a "by" term:

A by smooth like this will produce a varying coefficient model, where a
separate smooth is fitted to the levels of the by variable. As unique
datecodes becomes large, this may be inefficient for the purposes you
have in mind - which as per several of your recent emails appears, is to
fit an interaction.

In mgcv, interactions can be fitted using a bivariate smooth of the two
variables of interest. Without your data or information on the structure
of datecode, it is difficult to suggest exactly how to proceed (and
perhaps why you haven't received replies to your postings), but if
datecode represents the day-of-year or the month (expressed as a
numeric) in which sampling took place, then an interaction between
datecode and temperature could be achieved using a tensor product
smooth:

te(temperature, datecode, bs = c("cr","cc"))

I use a tensor product as we do not expect isotropy and the two
variables are on different scales. The bs species that temperature is a
cubic regression spline and that datecode is a cyclic cubic regression
spline so the end points join (i.e. you want smooth transition from
December to January). If your datecode variable does not got from month
1 (day-of-year 1) to month 12 (day-of-year 365(6)) then you might want
to specify the end-points so this smooth covers the full year. In your
call to gam, add argument 'knots'.

## for datecode === day-of-year one might use
knots = list(datecode = c(1, 365))
## for datecode === month, one might use
knots = list(datecode = c(1, 12))

i.e. with recent versions of mgcv (1.5-x branch) you can specify only
the start and end points of the smooth and gam fills in the knots up to
the chosen or default value of 'k' (the number of knots).

The te() smooth also needs version 1.5-5 of mgcv as there was a bug in
the code when setting up "cc" smooths in te() smooths that is fixed in
that version.

> 
> Distribution.s.temp.logwm2.deltaT<-gam(yoyras~s(temp,by=datecode)+s(logwm2,by=datecode)+s(DeltaT,by=datecode),data=AllData) 

This seems completely over-specified -  you are asking the model to
estimate separate smooths for each value of datecode for each of three
covariates. You'll need lots of data to fit such a model.

> 
>  However, I am not convinced this is the correct way to account for this 
> relationship. What do you think? Is there another way to include this in 
> my model? Maybe I should simply include date ("datecode") as another 
> term in the model?
> 
>  I also believe there may be an interaction between temperature and 
> light (logwm2), and based on what I have read the "by" method may be the 
> best way to include this. Correct?

In that case, temp, datecode and light will all be related via
interaction. te() smooths can take more than two covariates, so you
could try fitting a smooth on all three, but you are starting off with
quite a complicated model here.

I think, in general, interactions are fitted via multivariate smooths of
the variables of interest, not using by variables.

HTH

G

> 
>  Thank you for any input, tips, or advice you may be able to offer. I am 
> new to R, so especially grateful!
> 
> Thanks again,
> Paul Simonin
> (PhD student)
> 
> PS- If you would like additional information let me know. Also, if this question is inappropriate for the help list please let me know.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090624/c30eff71/attachment-0002.bin>


More information about the R-help mailing list