[R] rpart

Mon Nov 20 11:50:24 CET 2006

Dear r-help-list:

I' got a question about the computation of the improve of a split. The following is an extract of an output of the summary of a tree:

Node number 1: 600 observations,    complexity param=0.007272727
  predicted class=0  expected loss=0.1666667
    class counts:   500   100
   probabilities: 0.833 0.167 
  left son=2 (211 obs) right son=3 (389 obs)
  Primary splits:
      x4  < 0.5 to the left,  improve=1.2284910, (0 missing)
      x1  < 1.5 to the left,  improve=0.9729730, (0 missing)
      x10 < 1.5 to the right, improve=0.8371014, (0 missing)

Node number 2: 211 observations,    complexity param=0.006666667
  predicted class=0  expected loss=0.1232227
    class counts:   185    26
   probabilities: 0.877 0.123 
  left son=4 (123 obs) right son=5 (88 obs)
  Primary splits:
      x6  < 0.5 to the right, improve=1.0366150, (0 missing)
      x1  < 1.5 to the left,  improve=0.7918369, (0 missing)
      x11 < 0.5 to the right, improve=0.5032110, (0 missing)

Node number 3: 389 observations,    complexity param=0.007272727
  predicted class=0  expected loss=0.1902314
    class counts:   315    74
   probabilities: 0.810 0.190 
  left son=6 (209 obs) right son=7 (180 obs)
  Primary splits:
      x7  < 0.5 to the right, improve=1.2448010, (0 missing)
      x10 < 1.5 to the right, improve=1.2076890, (0 missing)
      x9  < 1.5 to the right, improve=0.8054428, (0 missing)

I used the default values for the "parms" parameter. So, loss is the unity matrix, prior are estimated by (5/6, 1/6) and split is "Gini".
Why is the improve of the first split 1.228?
My calculation:
Impurity measure at the root node: 1/6*5/6=5/36
Node 2: 185/211*26/211, weight: 211/600
Node 3: 315/389*74/389, weight: 389/600
-> improve=5/36 - 211/600 * 185/211*26/211 - 389/600 * 315/389*74/389 = 0.001023743
Is there any normalisation?

If I use matrix(c(0,3,3,0),nrow=2) as loss matrix, I get the same values as above. Shouldn't I get simply three times the improve of the case above because?
Or is there again any normalisation?

If I use matrix(c(0,1,5,0),nrow=2) as loss matrix, I get different values. Shouldn't I get simply the same improve as in the case "matrix(c(0,3,3,0),nrow=2)" because of the symmetrizaton of the loss matrix in case of two classes and the use of the Gini criterion?

Thank you very much for your help!

Henri 
-- 
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!