[R] rpart v. lda classification.
ripley@stats.ox.ac.uk
ripley at stats.ox.ac.uk
Wed Feb 12 09:10:08 CET 2003
On Tue, 11 Feb 2003, Rolf Turner wrote:
>
> I've been groping my way through a classification/discrimination
> problem, from a consulting client. There are 26 observations, with 4
> possible categories and 24 (!!!) potential predictor variables.
>
> I tried using lda() on the first 7 predictor variables and got 24 of
> the 26 observations correctly classified. (Training and testing both
> on the complete data set --- just to get started.)
>
> I then tried rpart() for comparison and was somewhat surprised when
> rpart() only managed to classify 14 of the 26 observations correctly.
> (I got the same classification using just the first 7 predictors as I
> did using all of the predictors.)
>
> I would have thought that rpart(), being unconstrained by a parametric
> model, would have a tendency to over-fit and therefore to appear to
> do better than lda() when the test data and training data are the
> same.
>
> Am I being silly, or is there something weird going on? I can
> give more detail on what I actually did, if anyone is interested.
The first. rpart is seriously constrained by having so few observations,
and its model is much more restricted than lda: axis-parallel splits only.
There is a similar example, with pictures, in MASS (on Cushings).
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list