Hello dear R-help members,
I recently became interested in using biglm with leaps, and found myself
somewhat confused as to how to use the two together, in different settings.
I couldn't find any example codes for the leaps() package (except for in the
help file, and the examples there are not as rich as they could be). That
is why I turn to you in case you could share some good tips and examples of
code on how to use the leaps package (especially the regsubsets command)
The problem that drives me to ask this is: how to use the regsubsets()
command to immulate a forward model selection procedure on a regressions
problem ?
I attach below a few direction dear Thomas has already wrote to me on the
subject, and any help would be very welcomed:
*me:*
I feel I am missing a big something here, so please help me here -
Let's say we have a dataset with an X matrix of 10 variables, and all we
want to perform is forward variable selection with AIC, starting from
the minimal model that includes the intercept only, and with the maximum
model of all variable and their interaction up to the second order.
In that range, we wish to find the best model, based on forward selection.
*Thomas:*
Use biglm() to fit the model with all main effects and all second order
interactions. This model will be the maximum model for selection.
The minimum model, by default, is the model with only an intercept, so you
don't need to specify anything. If the minimum model is more complicated,
the vector force.in specifies which terms are in the minimum model (a
logical vector with TRUE for variables in the minimum model and FALSE for
variables not in the minimum model).
regsubsets() will give you the best model with one variable, the best with
two variables, and so on. The object produced by summary() of the
regsubsets() has a component $cp that gives Mallows' Cp for each of the best
models. This is equivalent to AIC, or you can compute AIC from the residual
sum of squares in the $rss component of the object.
regsubsets() doesn't actually fit the models, it just works out the residual
sum of squares. You need to take the output of regsubsets() and then fit
which ever of the best models you want coefficients for.
summary(regsubsets.object)$which is a logical matrix indicating which
variables are in each of the best models.
This may seem unnecessarily complicated, but regsubsets() was designed for
situations where you want lots of best models rather than just one, since
there are often lots of models that are about equally good. That's the
point of the
plot() method, where you can look at hundreds of best models from 30 or so
variables and see which variables are in most of the good models, and which
variables tend to occur together or separately -- for example, if you have
two related variables such as systolic blood pressure and diastolic blood
pressure do they substitute for each other or do they tend to occur in the
same model.
Thanks all (and again - thanks Thomas for all your patient answers so far)
Tal
p.s: I already sent this e-mail once, but couldn't seem to see it on the
list, so I resent it again - sorry if any of you got it twice.
----------------------------------------------
My contact information:
Tal Galili
Phone number: 972-50-3373767
FaceBook: Tal Galili
My Blogs:
www.talgalili.com
www.biostatistics.co.il
[[alternative HTML version deleted]]