[R-sig-eco] multiple regression

Wed Feb 17 22:00:31 CET 2010

On Tue, Feb 16, 2010 at 11:15 PM, David Hewitt <dhewitt37 at gmail.com> wrote:
>> Date: Sun, 14 Feb 2010 17:15:02 -0700
>> From: Kingsford Jones <kingsfordjones at gmail.com>
>> To: Gustaf.Granath at ebc.uu.se
>> Cc: r-sig-ecology at r-project.org
>> Subject: Re: [R-sig-eco] multiple regression
>> Message-ID:
>>        <2ad0cc111002141615w178722c7u1e47315a7b8aa110 at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> I'd like to express some thoughts about two techniques that have been
>> (at least partially) defended in this thread:  i) standardizing
>> regression coefficients and ii) stepwise variable selection.  I think
>> both techniques are dangerous because they offer the illusion of
>> providing easy solutions to longstanding problems in regression
>> modeling: comparison of variable importance using observational data,
>> and automatic selection of a useful model
>
> I completely agree.
>
>> i)  By standardizing coefficients we're generally talking about
>> rescaling an input variable to change from the original units to the
>> sample standard deviation (or 2 sample standard deviations in the case
>> of the paper cited by Gustaf).
>
> Gelman and Hill (multilevel modeling), the book referred to by Gustaf
> in another thread, provides some nice examples where rescaling input
> variables leads to easier-to-interpret coefficients. I agree with KJ
> that doing so is not a cure-all, but it can help. Mostly it just
> relates to having a "smart" idea of the effects of interest and what
> they mean. Of course, rescaling or standardizing is often not an easy
> fix, as KJ notes.
>
> ...SNIP...
>
>> AFAIK, the better solutions proposed for comparing the relative
>> importance of variables use measures of e.g, SSEs or partial
>> correlations over all possible orderings of the model; but I believe
>> that in the face of multicollinearity you are still faced w/ problems
>> in interpreting 'importance'.  It's just a tough problem...
>
> It is indeed tough, but I don't think partial correlations/SSEs are a
> good route. What methods are you referring to in particular? I can't
> see how this would help except in the simplest linear models.

Hi David,

My aim wasn't to hold up those metrics as improved measures of
importance, but rather to mention the idea calculating a metric over
all possible orderings of the model.  E.g, see the Gromping paper I
cited earlier in the thread, or for more seminal work:

@article{1987,
title = {Relative Importance by Averaging Over Orderings},
author = {Kruskal, William},
journal = {The American Statistician},
volume = {41},
number = {1},
jstor_formatteddate = {Feb., 1987},
pages = {6--10},
abstract = {Many ways have been suggested for explicating the
ambiguous concept of relative importance for independent variables in
a multiple regression setting. There are drawbacks to all the
explications, but a relatively acceptable one is available when the
independent variables have a relevant, known ordering: consider the
proportion of variance of the dependent variable linearly accounted
for by the first independent variable; then consider the proportion of
remaining variance linearly accounted for by the second independent
variable; and so on. When, however, the independent variables do not
have a relevant ordering, that approach fails. The primary suggestion
of this article is to rescue the idea by averaging relative importance
over all orderings of the independent variables. Variations and
extensions of the idea are described.},
year = {1987},
publisher = {American Statistical Association}
}

Kingsford

>
>> ii) I planned to try and express my thoughts on stepwise selection,
>> but this is getting longwinded, ...
>> Earlier in this thread it was pointed out that if there is a strong
>> signal in the data a stepwise procedure will find it.  Perhaps, but
>> what about when there is zero signal.  Try running the function below
>> a few times with, for example, n = 50 and p = 15 (resulting in
>> observations:predictors ratio of 50:14 -- not an unusual ratio for
>> folks to try with stepwise procedures). Notice the step procedure
>> finds signal where there is only noise (serious overfitting with all
>> outputs biased high in absolute value).  Even though step does things
>> 'right' by using AIC and observing marginality constraints (which
>> don't come into play here but if you do more overfitting by adding
>> interactions, polynomials, splines etc, then it will), anytime the n:p
>> ratio is small, you're in big trouble.
>
> Agreed. And thus the search for one "best" (or "useful", although this
> is a much fuzzier idea) model is a bad idea. Multimodel inference and
> model averaging using some formal selection criterion has not been
> mentioned, but protects you in such situations to some extent. Far
> better than focusing on one model from a selection method of any type.
>
> --
> Dave Hewitt
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>