[R] How to measure/rank ?variable importance when using rpart?

Terry Therneau therneau at mayo.edu
Mon Jan 24 15:53:42 CET 2011


--- included message ----
Thus, my question is: *What common measures exists for ranking/measuring
variable importance of participating variables in a CART model? And how
can
this be computed using R (for example, when using the rpart package)*

---end ----

Consider the following printout from rpart
  summary(rpart(time ~ age + ph.ecog + pat.karno, data=lung))

Node number 1: 228 observations,    complexity param=0.03665178
  mean=305.2325, MSE=44176.93 
  left son=2 (81 obs) right son=3 (147 obs)
  Primary splits:
      pat.karno < 75   to the left,  improve=0.03661157, (3 missing)
      ph.ecog   < 1.5  to the right, improve=0.03620793, (1 missing)
      age       < 75.5 to the right, improve=0.01606491, (0 missing)
  Surrogate splits:
      ph.ecog < 1.5  to the right, agree=0.787, adj=0.392, (3 split)
      age     < 72.5 to the right, agree=0.680, adj=0.089, (0 split)

In Breiman, Friedman, Olshen, & Stone, the canonical CART book, the
pat.karno variable would get .0366 "points" for this split,
ph.ecog would get .0366 * .392 points
age     would get .0366 * .089 points

The reason for adding in surrogates is to account for redundant
variables.  Suppose for instance that x1=height but so is x10, just
measured on a different day.  They won't be exactly the same, so one
will get picked over the other at any given split; but at the end they
should get the same importance score.

This calculation is added up over all the splits to get a variable
importance.  So -- all the necessary ingredients are present.  Someone
just needs to write the importance function :-) 

Terry T.



More information about the R-help mailing list