[R] Classification problem - rpart

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Apr 10 20:16:30 CEST 2003


You have slope and tci10 (at least) coded as factors.  Run summary() on 
your data frame.  My bet is that you have MISSING in there, and did not
declare that when using read.table (or whatever)

Do *look* at your data at least cursorily.

On Thu, 10 Apr 2003, Andy Bunn wrote:

> I am performing a binary classification using a classification tree.
> Ironically, the data themselves are 2483 tree (real biological ones)
> locations as described by a suite of environmental variables (slope, soil
> moisture, radiation load, etc). I want to separate them from an equal number
> of random points. Doing eda on the data shows that there is substantial
> difference between the tree and random classes, e.g., box and whisker plots
> for slope show separation.
> 
> The data frame is thus:
> 
> curvegrid,dir2tl,dist2tl,slope,tasp,tci10,class
> -0.000244141,266,1852.701,2.382412,0.2124468,131,random
> 0.3005371,246,1146.342,10.45694,0.8045813,63,random
> .
> .
> .
> .
> -0.3000488,90,10,20.25561,-0.1293357,62,tree
> -0.5,90,10,18.68057,-0.05228489,61,tree
> -0.6994629,0,0,18.30121,0.0320744,66,tree
> 
> I've run rpart on similar data without an issue but when I try it on this
> data as follows:
> 
> tree <- rpart(class ~ curvegrid + slope + tci10, method="class")
> 
> I get the following output:
> 
> > tree
> n= 4966 
> 
> node), split, n, loss, yval, (yprob)
>       * denotes terminal node
> 
> 1) root 4966 2483 dw (0.500000000 0.500000000)  
>   2) slope=0.3206026,0.5159777,0.679302,0.7163697,1.1324.......... 2574   94
> dw (0.963480963 0.036519037) *
>   3) slope=0,0.1011371,0.1013844,0.2027681,0.2267014,0.32......... MISSING
> 2392    3 random (0.001254181 0.998745819) *
> 
> 
> This is not like other trees I have run!
> 
> And:
> 
> summary(tree)
> > summary(tree)
> Call:
> rpart(formula = class ~ curvegrid + slope + tci10)
>   n= 4966 
> 
>          CP nsplit  rel error    xerror       xstd
> 1 0.9609344      0 1.00000000 1.0322191 0.01418310
> 2 0.0100000      1 0.03906565 0.7635924 0.01378822
> 
> Node number 1: 4966 observations,    complexity param=0.9609344
>   predicted class=dw      expected loss=0.5
>     class counts:  2483  2483
>    probabilities: 0.500 0.500 
>   left son=2 (2574 obs) right son=3 (2392 obs)
>   Primary splits:
>      slope     splits as  RRRRRRLRRRLRRRRLLRRRRRRR.......
>      tci10     splits as  RRRRRRRRRRLLRLLRLLRLLRLL.......
> 
> etc.
> 
> Node number 2: 2574 observations
>   predicted class=dw      expected loss=0.03651904
>     class counts:  2480    94
>    probabilities: 0.963 0.037 
> 
> Node number 3: 2392 observations
>   predicted class=random  expected loss=0.001254181
>     class counts:     3  2389
>    probabilities: 0.001 0.999
> 
> I'm assuming that I have to adjust something in rpart.control. I am also
> hesitant at posting prematurely but am in fetters.

It's not rpart, it is your data manipulation: `pilot error'.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list