The **Rborist** package implements the Random Forest
(TM) algorithm, with particular emphasis on high performance. The
package is an **R**-language spinoff of the
**Arborist** project, a multi-language effort targeting a
variety of decision-tree methods. Look and feel owes a large debt to
Liaw’s original **randomForest** package.

The interpretation of the phrase “high performance” will vary among
users. We claim that the **Rborist** is a high-performance
package primarily because it either does, or has the potential to, take
advantage of the acceleration offerred by commodity parallel hardware.
We also expect performance to scale in accordance with algorithmic
complexity, decaying gracefully as resource limitations are
approached.

Particular attention has been paid to minimizing data movement and,
especially, toward maximizing *data locality*. We believe that
this has been a key contributing factor to performance and scalability
and will continue to play a major role in efforts to extend support more
broadly.

The current implementation is limited to in-memory execution on multicore and multi-socket hardware. We envision the following improvements in the near to medium term:

Separate training of tree blocks over multiple compute nodes.

Training of significant portions of individual trees on vector coprocessors, such as GPUs.

Pipelined training over out-of-memory workloads.

The simplest way to invoke the package is to pass it a design matrix
and response vector, of conforming dimensions. For appropriately typed
design *x* and response *y*, then, it suffices to
call:

`<- Rborist(x, y) rs `

The design can be either a *data.frame*, a numeric matrix or
an integer matrix. In the case of a frame, the columns (predictors)
must, individually, have either numeric or factor value. Integer
matrices are coerced internally to their numeric counterparts. The
response type may be either numeric, yielding a regression forest, or
categorical, yielding a classification forest.

The return value (here *rs*) is of class *Rborist*. The
full layout of an *Rborist* object is described by the
**help()** files. A very useful example is the
*validation* object, which summarizes testing on the out-of-bag
samples.

In regression training, the *validation* object’s members
include mean absolute and square errors, as well as the r-squared
statistic. Continuing from the previous codelet, these are obtainable as
follows:

```
$validation$mae
rs$validation$mse
rs$validation$rsq rs
```

These statistics are derived from the original training reponse
(*y*, above) and the derived out-of-bag reponse. The out-of-bag
response itself can also be obtained fromt the *validation*
object:

`$validation$yPred rs`

*validation* also contains member *qPred*, for use with
quantile regression. This member will be described in the next
section.

In classification training, the *validation* object also
presents the out-of-bag response in member *yPred*. Its other
members, however, are specialized for classification:

The misprediction rate is reported for each classification category by field

*mispredition*.A confusion matrix is reported by field

*confusion*.The out-of-bag error rate is given by

*oobError*.The

*census*field is a matrix giving, for each row, the number of times each response category is predicted for the row.The

*prob*field reports a normalized version of*census*and can be interpreted as the probability of predicting a given category at a given row.

In addition to *validation*, an *Rborist* object
contains several other members. Most of these are used to coordinate
subsequent operations, such as prediction or feature contribution, and
do not directly convey summary information. An exception is the
*training* member, which summarizes training statistics not
directly related to validation. *training* currently has a single
member, *info*:

`$training$info rs`

For each predictor the *info* vector reports the sum, over all
tree nodes in the forest for which the predictor defines a splitting
condition, of the information value precipitating the respective split.
Although these values depend on the information metric employed, they do
provide a relative measure of each predictor’s importance in the
model.

Validation can be suppressed as follows:

`<- Rborist(x, y, noValidate = TRUE) rs `

This option is primarily for the use of package maintainers, but may
prove useful, for example, in instances when model fit is not a main
concern. The *validation* field will then have value NULL.

When predicting over a regression forest, Rborist provides quantiles
simply for the asking. Leaf information, by default, is rich enough to
make their computation quite easy. Quantiles can be requested in either
of two ways. The simplest is to set option *quantiles* to
*TRUE*:

`<- Rborist(x, y, quantiles = TRUE) rs `

This default quantile vector consists of quartiles, which are given
by the *qPred* member mentioned above:

`$validation$qPred rs`

Quantile prediction also estimates the quantile position at which the
predicted mean, *yPred*, lies. This estimate is reported by the
*qEst* value, as follow:

’’‘{r, eval = FALSE} rs\(validation\)qEst’’’

Explicity specfiying the quantile vector, *quantVec*, yields
quantiles of any desired type. Deciles, for example, can be requested as
follows:

`<- Rborist(x, y, quantVec = seq(0.1, 1.0, by=0.1)) rs `

The algorithm employed to compute the quantiles is approximate, as bining is employed to improve performance.

If, on the other hand, when training a regression forest, quantiles
are not desired and it is intended that quantiles will not subsequently
be requested *after* training, space can be saved by representing
leaves in a sparser form:

`<- Rborist(x, y, thinLeaves=TRUE) rs `

The simplest way to affect forest size is to specify the number of trees. The following codelet requests 100 trees:

`<- Rborist(x, y, nTree=100) rs `

This also affects training time, which is expected to scale linearly with the number of trees.

Training of individual trees can be constrained according to several
parameters. The *minNode* option places a lower bound on node
sizes, i.e., the number of distinct samples subsumed by each node. A
value of 1, for example, allows splitting to proceed until purity. A
larger value results in a smaller tree, as well as faster training. To
ensure that splitting does not proceed below a node size of 20, for
example:

`<- Rborist(x, y, minNode=20) rs `

Another way to control tree size is to specify its maximal depth. The
option *nLevel* sets an upper bound on the number of levels over
which trees can be trained. The following codelet causes tree
construction to halt after the root is created, resulting in a forest of
singleton trees.

`<- Rborist(x, y, nLevel = 1) rs `

As with several other size-based constraints, constraining level depth also results in faster execution.

Option *maxLeaf* limits the number of leaves (terminals)
present in a tree. Unlike other size constraints described so far,
*maxLeaf* is enforced *after* training, so as not to bias
tree shape. While this pruning operation is quite fast, then, limiting
the number of leaves will not speed the training process.

`<- Rborist(x, y, maxLeaf = 13} rs `

Option *minInfo* sets a splitting threshold based on relative
information gain. A splitting candidate is rejected if the ratio of
information content between a node and its potential successors lies
below the threshold.

`<- Rborist(x, y, minRatio = 0.1) rs `

This option should be applied with care and should probably be avoided at low predictor count.

Performance as well as storage economy can in some cases both be
enhanced by abstracting away repeated instances of the same predictor
value. This is the idea behind sparse representations and, in
particular, one-hot encodings, in which repeated values of zero are
represented implicitly. The Arborist employs a simlilar strategy, but on
a *per-predictor* basis, representing high-frequency observations
of a given predictor implicitly. A plurality threshold above which to
compress repeated values is specified by the *autoCompress*
option:

`<- Rborist(x, y, autoCompress = 0.4) rs `

As *autoCompress* specifies a plurality threshold, only a
single set of repeated values undergoes compression for a given
predictor. Resolution of ties, in particular, is
implementation-dependent. A threshold frequency of 1.0, the maximum,
ensures that no compression takes place, while a threshold frequency of
0.0, the minimum, ensures some value is compressed for each predictor,
regardless of frequency.

As a complement to the *thinLeaves* option for training, the
*Streamline* command can be applied following training to reduce
a forest’s memory footprint. *Streamline* clears fields employed
by validation, quantile regression and feature contribution, so should
not be employed if any of these operations are expected to be performed
subsequently.

```
<- Rborist(x, y)
rs
...<- Streamline(rs) rs
```

Several options affect the behavior of the various sampling mechanims used by the package.

Option *nSamp* dictates the number of bagged samples defining
the root of each tree. A smaller sample count may result in smaller
trees, as fewer opportunites arise to split.

Option *withRepl* specifies whether bag sampling is to be done
with replacement.

Vector *rowWeight* weights the probability of bagging a given
row. The following invocation gives each row identical weight, which is
also the default behavior:

`<- Rborist(x, y, rowWeight = rep(1/nrow(y), nrow(y)) rs `

Vector *predWeight* weights the probability of selecting a
given predictor as splitting candidate. For example, this invocation
selects predictor 0 half as often, per predictor, as the remaining
predictors:

`<- Rborist(x, y, predWeight = c(0.5, rep(1.0, ncol(x)-1))) rs `

Option *predProb* is the uniform probability of selecting a
predictor for splitting. That is, the value applies uniformly to all
predictors. Hence the number of predictors tried at a given split will
have a binomial distribution.

Option *predFixed* is the actual number of predictors to test
for a given split, and so calls for sampling predictors without
replacement. *predFixed* and *predProb* cannot both be
specified within a single training operation.

For regression training, vector *regMono* specifies a (signed)
rejection probability to enforce monotonicity with respect to a given
predictor. Negative values specify rejection of nondecreasing splitting
candidates with a given probability, while positive values stipulate a
rejection probability for nonincreasing predictors. A value of zero
indicates that no monotonicity constraint is enforced. Values assigned
to factor predictors are ignored.

The following example rejects nonincreasing predictors as splitting candidates with probability one:

`<- Rborist(x, y, regMono = rep(1.0, ncol(x))) rs `

Classification training can be fine-tuned by weighting individiual
categories. The option *classWeight* permits weights to be
specified, by category, for the objective function used to split. This
may be useful, for example, when the training response is
unbalanced.

The following example employs a non-normalized weighting vector to give the first category twice as much weight as the others. Note that the category ecoding reflected by ‘levels()’ is not portable:

```
<- levels(y)
lv <- Rborist(x, y, classWeight = c(2.0, rep(1.0, length(lv) - 1)) rs
```

By default, when a numerical predictor is chosen to split a node, the
Arborist assigns its split value as that corresponding to the mean rank,
with respect to the *full* set of observations on the predictor,
between the two split boundary points. That is, the splitting criterion
attempts to reflect the *ECDF* of the entire sample. This
contrasts with other implementations, which typically select either the
midpoint value or one of the two boundary values. In particular,
depending upon how the observations are distributed, the midrank can
correspond to a value arbitrarily close to either of the two
boundaries.

The vector *splitQuant* allows a (sub)quantile value to be
interpolated, so that the split value can be manipulated more finely
with respect to the two endpoints. For example, the following codelet
expresses the default behavior, which is to select the middle rank
(i.e., 0.5 quantile) for all numerical predictors (if any):

`<- Rborist(x, y, splitQuant = rep(0.5, ncol(x))) rs `

Similarly, this example chooses the left endpoint for all relevant predictors:

`<- Rborist(x, y, splitQuant = rep(0.0, ncol(x))) rs `

The Arborist represents predictors internally using a format streamlined for subsequent training. A “preformatted” representation of the training data can be generated directly and trained separately as follows:

```
<- Preformat(x)
pf <- Rborist(pf, y) rs
```

Separate preformatting can result in a slight performance improvement
in workflows with iterated training, such as under *Caret*. This
is simply because the sorting and packing performed at the start of
training can be cached instead of repeatedly computed.

A better motivation for preformatting arises when the training data
can be represented sparsely. Suppose, for example, that *B* is a
large data set consisting chiefly of predictors with highly repetitive
observation values. As the Arborist is able to identify and compress
repetitive observations, storage might be conserved by first
preformatting *B*, then deleting it and finally training on the
preformatted representation:

```
<- Preformat(B)
pf rm(B)
<- Rborist(pf, y) rs
```

Unlike many implementations, which employ sorting to order
observations within a node, the Arborist employs a presort followed by
linear updates. The updates are a form of stable partition, which we
refer to as *restaging*. Restaging reduces the algorithmic
complexity from the oft-cited \(\mathcal{O}(n
\log{}^2 n),\) where *n* represents the number of training
samples, to \(\mathcal{O}(n \log{}
n).\)

Restaging is not without its problems, however. As currently implemented, a double buffer caches outcome data corresponding to every cell in the design. Hence the restaging containers can grow quite large. While some users may find this to be acceptable price for scalability, others may find the storage requirements too high for their application. Scalability, moreover, may not be an important concern at low sample or row count.

For high-dimensional problems, the **Ranger** package
may provide a suitable alternative. Similarly, for some problems, the
package **Xgboost** offers excellent storage
characteristics. In the meantime, we envision several improvements, to
appear in upcoming versions, to the Arborist’s parsimony with
storage.