[R-pkgs] Rule-based regression models: Cubist

Wed Apr 27 21:37:39 CEST 2011

Cubist is a rule-based machine learning model for regression. Parts of the
Cubist model are described in:

   Quinlan. Learning with continuous classes. Proceedings
   of the 5th Australian Joint Conference On Artificial
   Intelligence (1992) pp. 343-348

   Quinlan. Combining instance-based and model-based
   learning. Proceedings of the Tenth International Conference
   on Machine Learning (1993) pp. 236-243

RuleQuest, the company that created the program, now have a version
available under the GPL at:

   http://rulequest.com/cubist-info.html

We've taken the Cubist GPL code and created an R interface. The package
locations are:

   http://cran.r-project.org/web/packages/mvpart/index.html

and

   https://r-forge.r-project.org/projects/rulebasedmodels/

The primary functions are cubist() for creating the ruled and the terminal
models and predict.cubist() to predict new outcomes. The model allows for
instance-based corrections of the model predictions. We've separated the
instance-based correction from the model build so that the choice of
instances is only needed when samples are predicted. An interface for tuning
the Cubist model will be available in the caret package shortly.

We are also working on a similar port of C5.0 (also GPL'ed). The C code is
very similar, so much of the Cubist changes can be extended. That said, we'd
appreciate help if anyone wants to contribute.

Here is an example cubist session:

library(mlbench)
data(BostonHousing)

## 1 committee and no instance-based correction, so just an M5 fit:
mod1 <- cubist(x = BostonHousing[, -14], y = BostonHousing$medv)
summary(mod1)

## example output:

## Cubist [Release 2.07 GPL Edition]  Sun Apr 10 17:36:56 2011
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 506 cases (14 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [101 cases, mean 13.84, range 5 to 27.5, est err 1.98]
## 
##     if
##     nox > 0.668
##     then
##     outcome = -1.11 + 2.93 dis + 21.4 nox - 0.33 lstat + 0.008 b
##               - 0.13 ptratio - 0.02 crim - 0.003 age + 0.1 rm
## 
##   Rule 2: [203 cases, mean 19.42, range 7 to 31, est err 2.10]
## 
##     if
##     nox <= 0.668
##     lstat > 9.59
##     then
##     outcome = 23.57 + 3.1 rm - 0.81 dis - 0.71 ptratio - 0.048 age
##               - 0.15 lstat + 0.01 b - 0.0041 tax - 5.2 nox + 0.05 crim
##               + 0.02 rad
## 
##   Rule 3: [43 cases, mean 24.00, range 11.9 to 50, est err 2.56]
## 
##     if
##     rm <= 6.226
##     lstat <= 9.59
##     then
##     outcome = 1.18 + 3.83 crim + 4.3 rm - 0.06 age - 0.11 lstat - 0.003
tax
##               - 0.09 dis - 0.08 ptratio
## 
##   Rule 4: [163 cases, mean 31.46, range 16.5 to 50, est err 2.78]
## 
##     if
##     rm > 6.226
##     lstat <= 9.59
##     then
##     outcome = -4.71 + 2.22 crim + 9.2 rm - 0.83 lstat - 0.0182 tax
##               - 0.72 ptratio - 0.71 dis - 0.04 age + 0.03 rad - 1.7 nox
##               + 0.008 zn
## 
## 
## Evaluation on training data (506 cases):
## 
##     Average  |error|               2.07
##     Relative |error|               0.31
##     Correlation coefficient        0.94
## 
## 
##     Attribute usage:
##       Conds  Model
## 
##        80%   100%    lstat
##        60%    92%    nox
##        40%   100%    rm
##              100%    crim
##              100%    age
##              100%    dis
##              100%    ptratio
##               80%    tax
##               72%    rad
##               60%    b
##               32%    zn
## 
## 
## Time: 0.0 secs

Thanks,

Max, Steve and Chris