[R] How to measure/rank ?variable importance when using rpart?
Terry Therneau
therneau at mayo.edu
Mon Jan 24 15:53:42 CET 2011
--- included message ----
Thus, my question is: *What common measures exists for ranking/measuring
variable importance of participating variables in a CART model? And how
can
this be computed using R (for example, when using the rpart package)*
---end ----
Consider the following printout from rpart
summary(rpart(time ~ age + ph.ecog + pat.karno, data=lung))
Node number 1: 228 observations, complexity param=0.03665178
mean=305.2325, MSE=44176.93
left son=2 (81 obs) right son=3 (147 obs)
Primary splits:
pat.karno < 75 to the left, improve=0.03661157, (3 missing)
ph.ecog < 1.5 to the right, improve=0.03620793, (1 missing)
age < 75.5 to the right, improve=0.01606491, (0 missing)
Surrogate splits:
ph.ecog < 1.5 to the right, agree=0.787, adj=0.392, (3 split)
age < 72.5 to the right, agree=0.680, adj=0.089, (0 split)
In Breiman, Friedman, Olshen, & Stone, the canonical CART book, the
pat.karno variable would get .0366 "points" for this split,
ph.ecog would get .0366 * .392 points
age would get .0366 * .089 points
The reason for adding in surrogates is to account for redundant
variables. Suppose for instance that x1=height but so is x10, just
measured on a different day. They won't be exactly the same, so one
will get picked over the other at any given split; but at the end they
should get the same importance score.
This calculation is added up over all the splits to get a variable
importance. So -- all the necessary ingredients are present. Someone
just needs to write the importance function :-)
Terry T.
More information about the R-help
mailing list