[R] gam() (in mgcv) with multiple interactions

Tue Jun 7 18:00:37 CEST 2011

  Hi!  I'm learning mgcv, and reading Simon Wood's book on GAMs, as recommended to me earlier by some folks on this list.  I've run into a question to which I can't find the answer in his book, so I'm hoping somebody here knows.

   My outcome variable is binary, so I'm doing a binomial fit with gam().  I have five independent variables, all continuous, all uniformly distributed in [0, 1].  (This dataset is the result of a simulation model.)  Let's call them a,b,c,d,e for simplicity.  I'm interested in interactions such as a*b, so I'm using tensor product smooths such as te(a,b).  So far so good.  But I'm also interested in, let's say, a*d.  So ok, I put te(a,d) in as well.  Both of these have a as a marginal basis (if I'm using the right terminology; all I mean is, both interactions involve a), and I would have expected them to share that basis; I would have expected them to be constrained such that the effect of a when b=0, for one, would be the same as the effect of a when d=0, for the other.  This would be just as, in a GLM with formula a*b + a*d, that formula would expand to a + b + d + a:b + a:d, and there is only one "a"; a doesn't get to be different for the a*b interaction than it is for the a*d interaction.  But with tensor product smooths in gam(), that does not seem to be the case.  I'm still just getting to know mgcv and experimenting with things, so I may be doing something wrong; but the plots I have done of fits of this type appear to show different marginal effects.

  I tried explicitly including terms for the marginal basis; in my example, I tried a formula like te(a) + te(b) + te(d) + te(a, b) + te(a, d).  No dice; in this case, the main effect of a is different between all three places where it occurs in the model.  I.e. te(a) shows a different effect of a than te(a, b) shows at b=0, which is again different from the effect shown by te(a, d) at d=0.  I don't even know what that could possibly mean; it seems wrong to me that this could even be the case, but what do I know.  :->

   I could move up to a higher-order tensor like te(a,b,d), but there are three problems with that.  One, the b:d interaction (in my simplified example) is then also part of the model, and I'm not interested in it.  Two, given the set of interactions that I *am* interested in, I would actually be forced to do the full five-way te(a,b,c,d,e), and with a 300,000 row dataset, I shudder to think how long that will take to run, since it would have something like 5^5 free parameters to fit; that doesn't seem worth pursuing.  And three, interpretation of a five-way interaction would be unpleasant, to say the least; I'd much rather be able to stay with just the two-way (and one three-way) interactions that I know are of interest (I know this from previous logistic regression modelling of the dataset).

  For those who like to see the actual R code, here are two fits I've tried:

gam(outcome ~ te(acl, dispersal) + te(amplitude, dispersal) + te(slope, curvature, amplitude), family=binomial, data=rla, method="REML")

gam(outcome ~ te(slope) + te(curvature) + te(amplitude) + te(acl) + te(dispersal) + te(slope, curvature) + te(slope, amplitude) + te(curvature, amplitude) + te(acl, dispersal) + te(amplitude, dispersal) + te(slope, curvature, amplitude), family=binomial, data=rla, method="REML")

  So.  Any advice?  How can I correctly do a gam() fit involving multiple interactions that involve the same independent variable?

  Thanks!

Ben Haller
McGill University

http://biology.mcgill.ca/grad/ben/