[R-sig-eco] multiple regression

Wed Feb 17 07:15:34 CET 2010

> Date: Sun, 14 Feb 2010 17:15:02 -0700
> From: Kingsford Jones <kingsfordjones at gmail.com>
> To: Gustaf.Granath at ebc.uu.se
> Cc: r-sig-ecology at r-project.org
> Subject: Re: [R-sig-eco] multiple regression
> Message-ID:
>        <2ad0cc111002141615w178722c7u1e47315a7b8aa110 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> I'd like to express some thoughts about two techniques that have been
> (at least partially) defended in this thread:  i) standardizing
> regression coefficients and ii) stepwise variable selection.  I think
> both techniques are dangerous because they offer the illusion of
> providing easy solutions to longstanding problems in regression
> modeling: comparison of variable importance using observational data,
> and automatic selection of a useful model

I completely agree.

> i)  By standardizing coefficients we're generally talking about
> rescaling an input variable to change from the original units to the
> sample standard deviation (or 2 sample standard deviations in the case
> of the paper cited by Gustaf).

Gelman and Hill (multilevel modeling), the book referred to by Gustaf
in another thread, provides some nice examples where rescaling input
variables leads to easier-to-interpret coefficients. I agree with KJ
that doing so is not a cure-all, but it can help. Mostly it just
relates to having a "smart" idea of the effects of interest and what
they mean. Of course, rescaling or standardizing is often not an easy
fix, as KJ notes.

...SNIP...

> AFAIK, the better solutions proposed for comparing the relative
> importance of variables use measures of e.g, SSEs or partial
> correlations over all possible orderings of the model; but I believe
> that in the face of multicollinearity you are still faced w/ problems
> in interpreting 'importance'.  It's just a tough problem...

It is indeed tough, but I don't think partial correlations/SSEs are a
good route. What methods are you referring to in particular? I can't
see how this would help except in the simplest linear models.

> ii) I planned to try and express my thoughts on stepwise selection,
> but this is getting longwinded, ...
> Earlier in this thread it was pointed out that if there is a strong
> signal in the data a stepwise procedure will find it.  Perhaps, but
> what about when there is zero signal.  Try running the function below
> a few times with, for example, n = 50 and p = 15 (resulting in
> observations:predictors ratio of 50:14 -- not an unusual ratio for
> folks to try with stepwise procedures). Notice the step procedure
> finds signal where there is only noise (serious overfitting with all
> outputs biased high in absolute value).  Even though step does things
> 'right' by using AIC and observing marginality constraints (which
> don't come into play here but if you do more overfitting by adding
> interactions, polynomials, splines etc, then it will), anytime the n:p
> ratio is small, you're in big trouble.

Agreed. And thus the search for one "best" (or "useful", although this
is a much fuzzier idea) model is a bad idea. Multimodel inference and
model averaging using some formal selection criterion has not been
mentioned, but protects you in such situations to some extent. Far
better than focusing on one model from a selection method of any type.

-- 
Dave Hewitt