[R] rpart
"Jens Röder"
henrigel at gmx.de
Mon Nov 20 11:50:24 CET 2006
Dear r-help-list:
I' got a question about the computation of the improve of a split. The following is an extract of an output of the summary of a tree:
Node number 1: 600 observations, complexity param=0.007272727
predicted class=0 expected loss=0.1666667
class counts: 500 100
probabilities: 0.833 0.167
left son=2 (211 obs) right son=3 (389 obs)
Primary splits:
x4 < 0.5 to the left, improve=1.2284910, (0 missing)
x1 < 1.5 to the left, improve=0.9729730, (0 missing)
x10 < 1.5 to the right, improve=0.8371014, (0 missing)
Node number 2: 211 observations, complexity param=0.006666667
predicted class=0 expected loss=0.1232227
class counts: 185 26
probabilities: 0.877 0.123
left son=4 (123 obs) right son=5 (88 obs)
Primary splits:
x6 < 0.5 to the right, improve=1.0366150, (0 missing)
x1 < 1.5 to the left, improve=0.7918369, (0 missing)
x11 < 0.5 to the right, improve=0.5032110, (0 missing)
Node number 3: 389 observations, complexity param=0.007272727
predicted class=0 expected loss=0.1902314
class counts: 315 74
probabilities: 0.810 0.190
left son=6 (209 obs) right son=7 (180 obs)
Primary splits:
x7 < 0.5 to the right, improve=1.2448010, (0 missing)
x10 < 1.5 to the right, improve=1.2076890, (0 missing)
x9 < 1.5 to the right, improve=0.8054428, (0 missing)
I used the default values for the "parms" parameter. So, loss is the unity matrix, prior are estimated by (5/6, 1/6) and split is "Gini".
Why is the improve of the first split 1.228?
My calculation:
Impurity measure at the root node: 1/6*5/6=5/36
Node 2: 185/211*26/211, weight: 211/600
Node 3: 315/389*74/389, weight: 389/600
-> improve=5/36 - 211/600 * 185/211*26/211 - 389/600 * 315/389*74/389 = 0.001023743
Is there any normalisation?
If I use matrix(c(0,3,3,0),nrow=2) as loss matrix, I get the same values as above. Shouldn't I get simply three times the improve of the case above because?
Or is there again any normalisation?
If I use matrix(c(0,1,5,0),nrow=2) as loss matrix, I get different values. Shouldn't I get simply the same improve as in the case "matrix(c(0,3,3,0),nrow=2)" because of the symmetrizaton of the loss matrix in case of two classes and the use of the Gini criterion?
Thank you very much for your help!
Henri
--
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!
More information about the R-help
mailing list