[R] How to automate this model selection algorithm?
Andrew Crane-Droesch
andrewcd at gmail.com
Tue Mar 19 01:08:11 CET 2013
I've got a complicated semi-parametric model that I'm fitting with
mgcv. I start with a model based on theory. Its got lots of
interaction terms. I want to winnow it down: removing each interaction
term or un-interacted main effect one by one, checking the AIC, and
retaining the model that gives me the lowest AIC. I then want to repeat
the procedure on the retained model.
Here is a simple example:
set.seed(42)
library(mgcv)
N=220
fac = ceiling(runif(N)*2)
a = rnorm(N); b = rnorm(N); c = rnorm(N); d = runif(N); e = rnorm(N);
y = a^fac + b*e + d/(e+1)
m1 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,b,c)
+ te(a,d,by=as.factor(fac))
)
m2 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(b,c)
+ te(a,d,by=as.factor(fac))
)
m3 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,c)
+ te(a,d,by=as.factor(fac))
)
m4 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,b)
+ te(a,d,by=as.factor(fac))
)
m5 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,b,c)
+ te(d,by=as.factor(fac))
)
m6 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,b,c)
+ te(a,by=as.factor(fac))
)
m7 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,b,c)
+ te(a,d)
)
selection = AIC(m1,m2,m3,m4,m5,m6,m7)
selection
df AIC
m1 14.53435 1626.611
m2 12.52501 1622.635
m3 12.54566 1622.615
m4 12.52652 1622.633
m5 13.14108 1620.759
m6 10.99684 1621.156
m7 10.98136 1622.229
m5 is the best
m5 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,b,c)
+ te(d,by=as.factor(fac))
)
m5.1 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(b,c)
+ te(d,by=as.factor(fac))
)
m5.2 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,c)
+ te(d,by=as.factor(fac))
)
m5.3 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,b)
+ te(d,by=as.factor(fac))
)
m5.4 = gam(y~ as.factor(fac)
+ s(a)
+ s(b)
+ s(c)
+ s(d)
+ te(a,b,c)
#+ te(d,by=as.factor(fac))
)
selection.2 = AIC(m5,m5.1,m5.2,m5.3,m5.4)
selection.2
df AIC
m5 13.363029 1621.183
m5.1 9.671656 1617.641
m5.2 9.730047 1617.549
m5.3 9.706424 1617.569
m5.4 9.857504 1620.028
5.2 is the best
...and so on. I'd next try out each interaction term or un-interacted
main effect in m5.2. Question is how I can automate this? To start
with a model (m1, in my case), and have R give me the best model after
running this algorithm to the point where there are no longer any
"better" models according to this algorithm?
Right now I can do it manually, but as I add data over time, model
selection may change.
Note the mgcv's "select=TRUE" functionality doesn't quite work for me
(at least not directly), because I want to see if single parts of
three-way interactions can/should be removed.
Thanks in advance for any tips, and apologies for cross-posting (on
stack overflow).
Best,
Andrew
More information about the R-help
mailing list