[R] Where is gam?

Wed Oct 4 03:01:32 CEST 2000

Simon Wood notes:

> I don't know what other people think, but my concerns about gams 
> relate to the oddness of their structure. Often, modelling amounts to
> trying to find some unknown function linking your covariates to your
> response i.e. you want to find the f in:
> 
> E(y) = f(x_1, x_2, x_3....)
> 
> unless there's good prior reason to do so why would you start by letting:
> 
> f(.) = f_1(x_1) + f_2 (x_2) + f_3 (x_3) +.... ?

This is at the core of the reason why some of us think the method is
over-sold.  Neglecting interactions ab initio is always a dangerous
thing to do - this is at the core of the now famous dispute between
Box and Taguchi and even before that at the heart of the reason why
Fisher was so insistent that in designed experiments factors should
NOT be varied "one at a time" but as much as possible all together in
the one experiment, and block designs, factorial experiments,
confounding, fractional replication and all that malarkey came into
the world.  (This advice is still viewed with much surprise and
incredulity by people who do experiments in areas such as Physics and
Engineering, particularly.)

> The point is maybe clearest by considering polynomial models. If you don't
> know f(.) it seems reasonable to use some taylor expansion of f(.) as a
> model... e.g.
> 
> f(.) = a + b x_1 + c x_2 + d x_3 + e x_1 x_2 + g x_1 x_3 + h x_2 x_3 +
>        p x_1^2 + q x_2^2 + r x_3^2 + ...
> 
> 
> Using gams is rather like deleting all mixed terms in this expansion
> (i.e. e x_1 x_2,  g x_1 x_3 etc). This seems like a very odd thing to do
> and gets wierder the more degrees of freedom you allow the model.  

Yes, it does seem very strange to be packing lots of degrees of
freedom into the main effect terms and neglecting interactions
entirely, and it is, very often.

On the other hand, this can be viewed as an injunction to the modeller
on how to choose variables that go into a gam.  It has to be said that
experimenters can, very often, pick variables for inclusion in a model
that will account for the behaviour of the system fairly well and at
the same time are known beforehand to be unlikely to interact.  For
example I recently had a case where I had to model a marine (coastal)
phenomenon in terms of spatial co-ordinates (and others, of course).
With latitude and and longitude there was a very perceptible
interaction, but by tilting the axis system and choosing coordinates
that were essentially parallel to the coast and perpendicular to it,
respectively, the interaction largely disappeared, even though the
contributions to the model from both main effect terms were
perceptibly curvilinear.

My advice is not just to "try an additive model" without thinking
about the variables.  You really have to find out as much as you can
about them and the effect they might have on the behaviour of the
response and to choose suitable functions of them to go into an
additive model so that substantial interactions between them are
unlikely.  For an additive model to be successful this is an
absolutely essential step - you should even write down the reasons why
you think substantial interactions between the variables you have
ultimately chosen (or manufactured) are unlikely and include them in
your report of the work as a matter of simple modelling honesty, in my
view.  Then, of course, you should check this by throwing in a few
multiplicative terms, say, and just see if their addition
significantly improves the situation. (With large data sets you may
have to interpret "significantly" here rather carefully, too.  Every
term will typically reach the 5% level, but the real question is
whether the magnitude of their contribution to the outcome is
appreciable at the scale for which the model is intended.)

I have to say, too, that a lot of the attraction of gam's is that they
become easy to interpret and easy to explain to onlookers.  There is a
very strong and understandable tendency for people to ask "What is
*the* effect of variable X?", as if it were always a reasonable
question.  If you have a gam you hand them a very nice graph of the
additive component and they shut up.  Never mind that if the variable
has substantial interactions with others in the model the whole
question becomes ill-posed and not susceptible to an answer in
isolation from all other variables, such as the gam supposedly
provides.

gam's can indeed be useful, but they are no magic bullet: you still
need to go through the tough business of *modelling* your situation
and really thinking about the data that you get.

Bill Venables.
-- 
Bill Venables,      Statistician,     CMIS Environmetrics Project
CSIRO Marine Labs, PO Box 120, Cleveland, Qld,  AUSTRALIA.   4163
Tel: +61 7 3826 7251           Email: Bill.Venables at cmis.csiro.au    
Fax: +61 7 3826 7304      http://www.cmis.csiro.au/bill.venables/

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._