[R-sig-eco] multiple regression

Mon Feb 15 01:15:02 CET 2010

I'd like to express some thoughts about two techniques that have been
(at least partially) defended in this thread:  i) standardizing
regression coefficients and ii) stepwise variable selection.  I think
both techniques are dangerous because they offer the illusion of
providing easy solutions to longstanding problems in regression
modeling: comparison of variable importance using observational data,
and automatic selection of a useful model

i)  By standardizing coefficients we're generally talking about
rescaling an input variable to change from the original units to the
sample standard deviation (or 2 sample standard deviations in the case
of the paper cited by Gustaf).  For example one might scale the input
variable age in years by dividing it's values by the sample standard
deviation for age so that units are no longer 1 year, but as many
years happened to be the standard deviation of the age variable (e.g.
17.667 yrs)*.  The common misconception is that by doing this for all
inputs we can then then compare the absolute size of the resulting
regression coefficients to determine the 'relative importance' of
those variables.  For example if another input was weight in pounds
and it was scaled to the observed standard deviation weight units (say
3.135 kilos), and \hat{Beta}_age = 1 while \hat{Beta}_weight = 2, we
could then conclude weight has twice the relative importance as age
(given other predictors in the model).  Unfortunately it's not that
easy.  What's the basis for believing a 2 unit change in the expected
response for each 3.135 kilogram weight change is more important than
a 1 unit expected change in the response for each 17.667 years?  Some
might argue that comparing SDs is in itself interesting, but remember
we're making no assumptions about the predictor variables, so the SD
units do not have any natural interpretation relative to the
population (i.e. our observed SD may well not be a good estimate of
the population SD), and even if we did know the underlying pdf/pmf
from which our observed values came, there's no reason to believe that
a SD is an informative unit (i.e. it could be a skewed distribution,
or heck even Cauchy :-))

OK, now suppose an ideal situation where all input data are
representative samples from Normal distributions; would scaling the
data to unit variance allow comparison of regression coefficients?  In
an experimental situation with orthogonal predictors, yes -- otherwise
no.  E.g., observational data of the example variables give above (age
and weight) will be correlated, and we know what troubles
multicollinearity causes when it comes to interpreting regression
coefficients.  Standardizing the inputs will not help.

AFAIK, the better solutions proposed for comparing the relative
importance of variables use measures of e.g, SSEs or partial
correlations over all possible orderings of the model; but I believe
that in the face of multicollinearity you are still faced w/ problems
in interpreting 'importance'.  It's just a tough problem...

*Note that this is usually accompanied by mean-centering, which is a
useful technique in regression modeling, but that's another issue...

ii) I planned to try and express my thoughts on stepwise selection,
but this is getting longwinded, so I'll just paste a function below
that might be of interest.  The function explores (an aspect of)
overfitting, but does not address the issues of interpretation that
come from any automated model building process.

Earlier in this thread it was pointed out that if there is a strong
signal in the data a stepwise procedure will find it.  Perhaps, but
what about when there is zero signal.  Try running the function below
a few times with, for example, n = 50 and p = 15 (resulting in
observations:predictors ratio of 50:14 -- not an unusual ratio for
folks to try with stepwise procedures). Notice the step procedure
finds signal where there is only noise (serious overfitting with all
outputs biased high in absolute value).  Even though step does things
'right' by using AIC and observing marginality constraints (which
don't come into play here but if you do more overfitting by adding
interactions, polynomials, splines etc, then it will), anytime the n:p
ratio is small, you're in big trouble.  Note that in 'real life' even
when the n:p ratio is large we often end up overfitting because the
observed data become very sparse in high-dimensional space (e.g. think
about an interaction with a low prevalence binary variable, or the
number of observations in the design space where, e.g. elevation and
temperature are both high -- and that's only 2 dimensions!)

Hoping some of this is useful or will spur some discussion.

best,

Kingsford Jones

#################################################################

# explore overfitting via automated model selection (here step) #
# 2010Feb14                                                    #

#################################################################
#  Create data frame with 'p' cols with 'n' random normal observations

#  Use "..." to pass args to step (e.g. AIC parameter penalty (k))

falseSignal <- function(n, p, trace = 0, ...){

  dat <- as.data.frame(replicate(p, rnorm(n)))

  # make a full lm model using all columns to predict the first column

  full <- lm(V1 ~ ., dat)

  # reduce the model using the step procedure

  red <- step(full, trace = trace, ...)

  lst <- list(full.mod = summary(full), red.mod = summary(red))

  lst	

}

# Example
# falseSignal(50, 15)

On Tue, Feb 9, 2010 at 5:33 AM, Gustaf Granath <Gustaf.Granath at ebc.uu.se> wrote:
> Standardized coefficients are not necessarily a bad idea.
>
> Gelman A., 2008. Scaling regression inputs by dividing by two standard
> deviations. Stat Med. 2008, 27:2865-73.
>
> GG
>>
>> ...and you can also read in Frank Harrell's book why standardized
>> coefficients are a bad idea.  There is a large statistical literature
>> on variable importance in regression models.  For a discussion and
>> accompanying R package see
>>
>> @article{gr?mping2006relative,
>>  title={{Relative importance for linear regression in R: the package
>> relaimpo}},
>>  author={Gr{\\"o}mping, U.},
>>  journal={Journal of Statistical Software},
>>  volume={17},
>>  number={1},
>>  pages={139--147},
>>  year={2006},
>>  publisher={American Statistical Association}
>> }
>>
>>
>> hth,
>>
>> Kingsford Jones
>>
>> 2010/2/8 Aitor Gast?n <aitor.gaston at upm.es>:
>>
>>>
>>> >
>>> > Hi Nathan,
>>> >
>>> > Many authors criticize stepwise variable selection, e.g., Harrell,
>>> > F.E.,
>>> > 2001, Regression modelling strategies with applications to linear
>>> > models,
>>> > logistic regression and survival analysis. ?You can find some of his
>>> > arguments and extra references in
>>> > http://childrens-mercy.org/stats/faq/faq12.asp
>>> >
>>> > Cheers,
>>> >
>>> > Aitor
>>> >
>>> > --------------------------------------------------
>>> > From: "Nathan Lemoine" <lemoine.nathan at gmail.com>
>>> > Sent: Saturday, February 06, 2010 5:17 PM
>>> > To: <r-sig-ecology at r-project.org>
>>> > Subject: [R-sig-eco] multiple regression
>>> >
>>>
>>>>
>>>> >> Hi everyone,
>>>> >>
>>>> >> I'm trying to fit a multiple regression model and have run into some
>>>> >> questions regarding the appropriate procedure to use. I am trying to
>>>> >> compare
>>>> >> fish assemblages (species richness, total abundance, etc.) to metrics
>>>> >> of
>>>> >> habitat quality. I swam transects are recorded all fish observed,
>>>> >> then I
>>>> >> measured the structural complexity and live coral ?cover over each
>>>> >> transect.
>>>> >> I am interested in weighting which of these ?two metrics has the
>>>> >> largest
>>>> >> influence on structuring fish assemblages.
>>>> >>
>>>> >> My strategy was to use a multiple linear regression. Since the data
>>>> >> ?were
>>>> >> in two different measurement units, I scaled the variables to a ?mean
>>>> >> of 0
>>>> >> and std. dev. of 1. This should allow me to compare the ?sizes of the
>>>> >> beta
>>>> >> coefficients to determine the relative (but not ?absolute) importance
>>>> >> of
>>>> >> each habitat variable on the fish assemblage, ?correct?
>>>> >>
>>>> >> My model was lm(Species Richness~Complexity+Coral Cover). I had run a
>>>> >> full
>>>> >> model and found no evidence of interactions, so I ran it without ?the
>>>> >> interaction present.
>>>> >>
>>>> >> It turns out coral cover was not significant in any regression. I
>>>> >> have
>>>> >> been told that the test I used was incorrect and that the appropriate
>>>> >> procedure is a stepwise regression, which would, undoubtedly, provide
>>>> >> ?me
>>>> >> with Complexity as a significant variable and remove Coral Cover.
>>>> >> ?This
>>>> >> seems to me to be the exact same interpretation as the above ?model.
>>>> >> So,
>>>> >> since I'm very new to all of this, I am wondering how to ?tell
>>>> >> whether one
>>>> >> model is 'incorrect' or 'inappropriate' given that ?they yield almost
>>>> >> identical results? What are the advantages of a ?stepwise regression
>>>> >> over a
>>>> >> standard multiple regression like I have run?
>>>> >>
>>>> >> _______________________________________________
>>>> >> R-sig-ecology mailing list
>>>> >> R-sig-ecology at r-project.org
>>>> >> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>>> >>
>>>>
>>>
>>> >
>>> > _______________________________________________
>>> > R-sig-ecology mailing list
>>> > R-sig-ecology at r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>> >
>>>
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>