[R] Are minbucket and minsplit rpart options working as expected?

Carlos J. Gil Bellosta cgb at datanalytics.com
Wed Dec 7 20:10:51 CET 2005


Dear r-list:

I am using rpart to build a tree on a dataset. First I obtain a perhaps too
large tree:

> arbol.bsvg.02 <- rpart(formula, data = bsvg, subset=grp.entr,
control=rpart.control(cp=0.001))
> arbol.bsvg.02
n= 100000

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 100000 6657 0 (0.93343000 0.06657000)
    2) meses_antiguedad_svg>=10.5 73899 3658 0 (0.95050001 0.04949999)
      4) eor_n1_gns< 1.5 63968 2807 0 (0.95611868 0.04388132)
        8) tarifa_gas=31,32,33,34 63842 2771 0 (0.95659597 0.04340403) *
        9) tarifa_gas=NO 126   36 0 (0.71428571 0.28571429)
         18) tipo_mercado=ESP,N/A 90   10 0 (0.88888889 0.11111111) *
         19) tipo_mercado=NE ,SAH,SAV 36   10 1 (0.27777778 0.72222222) *
      5) eor_n1_gns>=1.5 9931  851 0 (0.91430873 0.08569127)
       10) sn_calef>=0.5 8390  546 0 (0.93492253 0.06507747) *
       11) sn_calef< 0.5 1541  305 0 (0.80207657 0.19792343)
         22) tarifa_gas=31,NO 1134  141 0 (0.87566138 0.12433862) *
         23) tarifa_gas=32 407  164 0 (0.59705160 0.40294840)
           46) cons_gas_delta_1< 6997 196   51 0 (0.73979592 0.26020408) *
           47) cons_gas_delta_1>=6997 211   98 1 (0.46445498 0.53554502)
             94) meses_antiguedad_svg>=23.5 134   54 0 (0.59701493 0.40298507)
              188) altitud< 312 61   16 0 (0.73770492 0.26229508) *
              189) altitud>=312 73   35 1 (0.47945205 0.52054795)
                378) back_office>=1.5 39   12 0 (0.69230769 0.30769231) *
                379) back_office< 1.5 34    8 1 (0.23529412 0.76470588) *
             95) meses_antiguedad_svg< 23.5 77   18 1 (0.23376623 0.76623377) *
    3) meses_antiguedad_svg< 10.5 26101 2999 0 (0.88510019 0.11489981)
      6) sn_calef>=0.5 20129 1853 0 (0.90794376 0.09205624) *
      7) sn_calef< 0.5 5972 1146 0 (0.80810449 0.19189551)
       14) tarifa_gas=31 4406  664 0 (0.84929641 0.15070359) *
       15) tarifa_gas=32,NO 1566  482 0 (0.69220945 0.30779055)
         30) eor_n1_gns< 0.5 1168  306 0 (0.73801370 0.26198630) *
         31) eor_n1_gns>=0.5 398  176 0 (0.55778894 0.44221106)
           62) back_office>=0.5 148   35 0 (0.76351351 0.23648649) *
           63) back_office< 0.5 250  109 1 (0.43600000 0.56400000) *

So I decide not to consider branches with less than 1000 observations, a 1% of
the original number of observations. Therefore, according to the rpart.control
help pages, I set minbucket=1000. However,

> arbol.bsvg.02
n= 100000

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 100000 6657 0 (0.9334300 0.0665700) *

And I get an "empty" tree. But there were branches in the original tree with
more than 1000 observations. Something similar happens if I set minsplit (or
both minbucket and minsplit) to a similar value: I end up with the same root,
branch-less tree.

Am I misreading something? Can anybody cast a light on the correct usage of the
minbucket (and/or minsplit) for me?

Sincerely,

Carlos J. Gil Bellosta
http://www.datanalytics.com




More information about the R-help mailing list