[R] Are minbucket and minsplit rpart options working as expected?
Carlos J. Gil Bellosta
cgb at datanalytics.com
Wed Dec 7 20:10:51 CET 2005
Dear r-list:
I am using rpart to build a tree on a dataset. First I obtain a perhaps too
large tree:
> arbol.bsvg.02 <- rpart(formula, data = bsvg, subset=grp.entr,
control=rpart.control(cp=0.001))
> arbol.bsvg.02
n= 100000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 100000 6657 0 (0.93343000 0.06657000)
2) meses_antiguedad_svg>=10.5 73899 3658 0 (0.95050001 0.04949999)
4) eor_n1_gns< 1.5 63968 2807 0 (0.95611868 0.04388132)
8) tarifa_gas=31,32,33,34 63842 2771 0 (0.95659597 0.04340403) *
9) tarifa_gas=NO 126 36 0 (0.71428571 0.28571429)
18) tipo_mercado=ESP,N/A 90 10 0 (0.88888889 0.11111111) *
19) tipo_mercado=NE ,SAH,SAV 36 10 1 (0.27777778 0.72222222) *
5) eor_n1_gns>=1.5 9931 851 0 (0.91430873 0.08569127)
10) sn_calef>=0.5 8390 546 0 (0.93492253 0.06507747) *
11) sn_calef< 0.5 1541 305 0 (0.80207657 0.19792343)
22) tarifa_gas=31,NO 1134 141 0 (0.87566138 0.12433862) *
23) tarifa_gas=32 407 164 0 (0.59705160 0.40294840)
46) cons_gas_delta_1< 6997 196 51 0 (0.73979592 0.26020408) *
47) cons_gas_delta_1>=6997 211 98 1 (0.46445498 0.53554502)
94) meses_antiguedad_svg>=23.5 134 54 0 (0.59701493 0.40298507)
188) altitud< 312 61 16 0 (0.73770492 0.26229508) *
189) altitud>=312 73 35 1 (0.47945205 0.52054795)
378) back_office>=1.5 39 12 0 (0.69230769 0.30769231) *
379) back_office< 1.5 34 8 1 (0.23529412 0.76470588) *
95) meses_antiguedad_svg< 23.5 77 18 1 (0.23376623 0.76623377) *
3) meses_antiguedad_svg< 10.5 26101 2999 0 (0.88510019 0.11489981)
6) sn_calef>=0.5 20129 1853 0 (0.90794376 0.09205624) *
7) sn_calef< 0.5 5972 1146 0 (0.80810449 0.19189551)
14) tarifa_gas=31 4406 664 0 (0.84929641 0.15070359) *
15) tarifa_gas=32,NO 1566 482 0 (0.69220945 0.30779055)
30) eor_n1_gns< 0.5 1168 306 0 (0.73801370 0.26198630) *
31) eor_n1_gns>=0.5 398 176 0 (0.55778894 0.44221106)
62) back_office>=0.5 148 35 0 (0.76351351 0.23648649) *
63) back_office< 0.5 250 109 1 (0.43600000 0.56400000) *
So I decide not to consider branches with less than 1000 observations, a 1% of
the original number of observations. Therefore, according to the rpart.control
help pages, I set minbucket=1000. However,
> arbol.bsvg.02
n= 100000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 100000 6657 0 (0.9334300 0.0665700) *
And I get an "empty" tree. But there were branches in the original tree with
more than 1000 observations. Something similar happens if I set minsplit (or
both minbucket and minsplit) to a similar value: I end up with the same root,
branch-less tree.
Am I misreading something? Can anybody cast a light on the correct usage of the
minbucket (and/or minsplit) for me?
Sincerely,
Carlos J. Gil Bellosta
http://www.datanalytics.com
More information about the R-help
mailing list