[R] gam() (in mgcv) with multiple interactions
Ben Haller
rhelp at sticksoftware.com
Tue Jun 7 18:00:37 CEST 2011
Hi! I'm learning mgcv, and reading Simon Wood's book on GAMs, as recommended to me earlier by some folks on this list. I've run into a question to which I can't find the answer in his book, so I'm hoping somebody here knows.
My outcome variable is binary, so I'm doing a binomial fit with gam(). I have five independent variables, all continuous, all uniformly distributed in [0, 1]. (This dataset is the result of a simulation model.) Let's call them a,b,c,d,e for simplicity. I'm interested in interactions such as a*b, so I'm using tensor product smooths such as te(a,b). So far so good. But I'm also interested in, let's say, a*d. So ok, I put te(a,d) in as well. Both of these have a as a marginal basis (if I'm using the right terminology; all I mean is, both interactions involve a), and I would have expected them to share that basis; I would have expected them to be constrained such that the effect of a when b=0, for one, would be the same as the effect of a when d=0, for the other. This would be just as, in a GLM with formula a*b + a*d, that formula would expand to a + b + d + a:b + a:d, and there is only one "a"; a doesn't get to be different for the a*b interaction than it is for the a*d interaction. But with tensor product smooths in gam(), that does not seem to be the case. I'm still just getting to know mgcv and experimenting with things, so I may be doing something wrong; but the plots I have done of fits of this type appear to show different marginal effects.
I tried explicitly including terms for the marginal basis; in my example, I tried a formula like te(a) + te(b) + te(d) + te(a, b) + te(a, d). No dice; in this case, the main effect of a is different between all three places where it occurs in the model. I.e. te(a) shows a different effect of a than te(a, b) shows at b=0, which is again different from the effect shown by te(a, d) at d=0. I don't even know what that could possibly mean; it seems wrong to me that this could even be the case, but what do I know. :->
I could move up to a higher-order tensor like te(a,b,d), but there are three problems with that. One, the b:d interaction (in my simplified example) is then also part of the model, and I'm not interested in it. Two, given the set of interactions that I *am* interested in, I would actually be forced to do the full five-way te(a,b,c,d,e), and with a 300,000 row dataset, I shudder to think how long that will take to run, since it would have something like 5^5 free parameters to fit; that doesn't seem worth pursuing. And three, interpretation of a five-way interaction would be unpleasant, to say the least; I'd much rather be able to stay with just the two-way (and one three-way) interactions that I know are of interest (I know this from previous logistic regression modelling of the dataset).
For those who like to see the actual R code, here are two fits I've tried:
gam(outcome ~ te(acl, dispersal) + te(amplitude, dispersal) + te(slope, curvature, amplitude), family=binomial, data=rla, method="REML")
gam(outcome ~ te(slope) + te(curvature) + te(amplitude) + te(acl) + te(dispersal) + te(slope, curvature) + te(slope, amplitude) + te(curvature, amplitude) + te(acl, dispersal) + te(amplitude, dispersal) + te(slope, curvature, amplitude), family=binomial, data=rla, method="REML")
So. Any advice? How can I correctly do a gam() fit involving multiple interactions that involve the same independent variable?
Thanks!
Ben Haller
McGill University
http://biology.mcgill.ca/grad/ben/
More information about the R-help
mailing list