[R] Recursive partitioning with multicollinear variables
Frank E Harrell Jr
feh3k at spamcop.net
Mon Feb 9 12:53:47 CET 2004
On Mon, 9 Feb 2004 11:24:39 +0100
"Jean-Noel" <jean-noel.candau at avignon.inra.fr> wrote:
> Dear all,
> I would like to perform a regression tree analysis on a dataset with
> multicollinear variables (as climate variables often are). The questions
> that I am asking are:
> 1- Is there any particular statistical problem in using multicollinear
> variables in a regression tree?
> 2- Multicollinear variables should appear as alternate splits. Would it
> be
> more accurate to present these alternate splits in the results of the
> analysis or apply a variable selection or reduction procedure before the
> regression tree?
> Thank you in advance,
>
> Jean-Noel Candau
A more accurate and stable result would be obtained by performing a data
reduction procedure that ignores the response variable. Combining
collinear variables into an index is often better than arbitrarily
choosing between them. Then use the indexes in a regression model unless
you have tens of thousands of observations for recursive partitioning, or
are using bagging of trees or a related procedure to cancel out the
instability in the tree growing process [which unfortunately will often
result in an average of trees that is more complex in appearance than a
regression model].
Frank
---
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list