`etree` is the R package where Energy Trees (Giubilei et al.,
2022) are implemented. The model allows to perform classification and
regression with covariates that are possibly structured and of different
types.

Energy Trees are used as base learners in ensemble methods that are the equivalent of bagging and Random Forests with traditional decision trees. The two resulting models are bagging of Energy Trees and Random Energy Forests, respectively.

This vignette explains how these two models can be used in R for
classification and regression tasks. Both models are implemented using
the function `eforest()`. The variables’ types included in the
examples are those currently available in the package: numeric, nominal,
functional, and in the form of graphs.

The first thing to do to run the examples is loading the package
`etree`.

`library(etree)`

We need data for our examples. We can use the same covariates for regression and classification, and then change only the response. Let us consider \(100\) observations divided into four equal-sized and disjoint groups \(G_1, G_2, G_3, G_4\). The response variable for regression is defined as:

\[Y^R_i \sim \begin{cases} \mathcal{N}(0, 1) & \text{for} \; i \in G_1\\ \mathcal{N}(1, 1) & \text{for} \; i \in G_2\\ \mathcal{N}(2, 1) & \text{for} \; i \in G_3\\ \mathcal{N}(3, 1) & \text{for} \; i \in G_4 \end{cases}.\]

The corresponding R object must be of class `“numeric”` to be
correctly recognized by `eforest()` as a response variable for
regression.

```
# Set seed
set.seed(123)
# Number of observation
<- 100
n_obs
# Response variable for regression
<- c(rnorm(n_obs / 4, mean = 0, sd = 1),
resp_reg rnorm(n_obs / 4, mean = 2, sd = 1),
rnorm(n_obs / 4, mean = 4, sd = 1),
rnorm(n_obs / 4, mean = 6, sd = 1))
```

The response for classification is a nominal variable measured at four different levels: \(1, 2, 3, 4\). For each group, observations are randomly assigned one of the four levels. The probability distribution is different for each group, and it gives the largest probability to the level equal to the index of the group, uniformly allocating the rest to the other three levels. For example, the response variable for the first group is the following:

\[Y^C_i \sim \begin{cases} 1 & p = 0.85 \\ 2 & p = 0.05 \\ 3 & p = 0.05 \\ 4 & p = 0.05 \end{cases}, \quad \text{for} \; i \in G_1.\]

The same holds for \(G_2, G_3, G_4\), except for switching each time the probabilities in such a way that the most probable level matches the index of the group.

The response variable for classification must be defined as a
`factor` to be correctly identified by `eforest()`.

```
# Response variable for classification
<- factor(c(sample(x = 1:4, size = n_obs / 4, replace = TRUE,
resp_cls prob = c(0.85, 0.05, 0.05, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.85, 0.05, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.05, 0.85, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.05, 0.05, 0.85))))
```

For both tasks, covariates are defined as follows:

Numeric: \[X_{1i} \sim \begin{cases} \mathcal{U}[0, 0.55] & \text{for} \; i \in G_1\\ \mathcal{U}[0.45, 1] & \text{for} \; i \in G_2 \cup G_3 \cup G_4 \end{cases};\]

Nominal: \[X_{2i} \sim \begin{cases} \text{Bern}(0.1) & \text{for} \; i \in G_2\\ \text{Bern}(0.9) & \text{for} \; i \in G_1 \cup G_3 \cup G_4 \end{cases};\]

Functions: \[X_{3i} \sim \begin{cases} \mathcal{GP}(0, I_{100}) & \text{for} \; \in G_3\\ \mathcal{GP}(1, I_{100}) & \text{for} \; \in G_1 \cup G_2 \cup G_4 \end{cases},\]

where \(\mathcal{GP}\) is a Gaussian random process over \(100\) evaluation points from \(0\) to \(1\);

- Graphs: \[X_{4i} \sim \begin{cases} \text{ER}(100, 0.1) & \text{for} \; \in G_4\\ \text{ER}(100, 0.9) & \text{for} \; \in G_1 \cup G_2 \cup G_3 \end{cases},\]

where \(\text{ER}(n, p)\) is an Erdős–Rényi random graph with \(n\) vertices and \(p\) as connection probability.

The structure of the generative process is such that we need at least three splits with respect to three different covariates to correctly rebuild the four groups.

The set of covariates must be generated as a list, where each element
is a different variable. Additionally, numeric variables must be numeric
vectors, nominal variable must be factors, functional variables must be
objects of class `“fdata”`, and variables in the forms of graphs
must be lists of objects having class `“igraph”`.

```
# Numeric
<- c(runif(n_obs / 4, min = 0, max = 0.55),
x1 runif(n_obs / 4 * 3, min = 0.45, max = 1))
# Nominal
<- factor(c(rbinom(n_obs / 4, 1, 0.1),
x2 rbinom(n_obs / 4, 1, 0.9),
rbinom(n_obs / 2, 1, 0.1)))
# Functions
<- c(fda.usc::rproc2fdata(n_obs / 2, seq(0, 1, len = 100), sigma = 1),
x3 ::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1,
fda.uscmu = rep(1, 100)),
::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1))
fda.usc
# Graphs
<- c(lapply(1:(n_obs / 4 * 3), function(j) igraph::sample_gnp(100, 0.1)),
x4 lapply(1:(n_obs / 4), function(j) igraph::sample_gnp(100, 0.9)))
# Covariates list
<- list(X1 = x1, X2 = x2, X3 = x3, X4 = x4) cov_list
```

Function `eforest()` allows implementing bagging of Energy
Trees (BET) and Random Energy Forests (REF), depending on the value of
the `random_covs` argument: if it is a positive integer smaller
than the total number of covariates, the function goes for REF; if it is
`NULL` (default), or equal to the number of covariates, BET is
used.

As for `etree()`, also `eforest()` is specified in the
same way for regression and classification tasks, automatically
distinguishing between the two based on the type of the response.

Many arguments are inherited from `etree()`. The additional
ones are:

`ntrees`, i.e., the number of Energy Trees to grow in the ensemble;`ncores`, i.e., the number of cores to use;`perf_metric`, i.e., the performance metric used to compute the Out-Of_Bag error;`random_covs`, i.e., the size of the random subset of covariates to choose from at each split;`verbose`, to visually keep track of the number of trees during the fitting process.

The structure of `eforest()` is such that it first generates
`ntrees` bootstrap samples and then calls `etree()` on
each of them. Then, it computes the Out-Of-Bag (OOB) score using the
performance metric defined through `perf_metric`.

The object returned by `eforest()` is of class
`“eforest”` and it is composed of three elements:

`ensemble`, i.e., a list containing all the fitted trees;`oob_score`, i.e., the Out-Of-Bag score;`perf_metric`, i.e., the performance metric used for computing the OOB.

The first example is regression. The set of covariates is
`cov_list` and the response variable is `resp_reg`. The
most immediate way to grow an ensemble using `eforest()` is by
specifying the response and the set of covariates only. This corresponds
to using the default values for the optional arguments. Here, we add
`ntrees = 12` to reduce the computational burden and change the
performance metric using `perf_metric = ‘MSE’` (the default is
`RMSPE` for Root Mean Square Percentage Error).

```
# Quick fit
<- eforest(response = resp_reg,
ef_fit covariates = cov_list,
ntrees = 12,
perf_metric = 'MSE')
#> Warning: executing %dopar% sequentially: no parallel backend registered
```

By default, `eforest()` uses `minbucket = 1` and
`alpha = 1` to grow larger and less correlated trees. It employs
`“cluster”` as `split_type` because it is usually faster.
For regression, `random_covs` is set to one third (rounded down)
of the total number of covariates.

The returned object, `ef_fit`, stores all the trees in its
element `ensemble`, which is a list. Each tree in the list can be
inspected, plotted, subsetted, and used for predictions just as any
other tree produced with `etree()` (in fact, also these trees are
produced with `etree()`!). As an example, we can plot the first
tree.

```
# Plot of a tree from the ensemble
plot(ef_fit$ensemble[[4]])
```

The Out-Of-Bag score for the whole ensemble is stored in
`ef_fit`’s element `oob_score`. The performance metric
used to compute the OOB score can be retrieved from element
`perf_metric`.

```
# Retrieve performance metric
$perf_metric
ef_fit#> [1] "MSE"
# OOB MSE for this ensemble
$oob_score
ef_fit#> [1] 1.289393
```

Finally, the ensemble can be used to make predictions using the
function `predict()` on `ef_fit`, and possibly specifying
a new set of covariates using `newdata`. Any other information,
such as the splitting strategy or the type of the task, is automatically
retrieved from the fitted object.

Predictions are based either on the fitted values - if
`newdata` is not provided - or on the new set of covariates - if
`newdata` is specified.

In both cases, each tree in `ef_fit$ensemble` is used to make
predictions by calling `predict()` on it (with the same
specification of `newdata`). Then, for regression, individual
trees’ predictions for single observations are combined by arithmetic
mean.

Let’s start with the case where `newdata` is not provided.

```
# Predictions from the fitted object
<- predict(ef_fit)
pred print(pred)
#> 10 48 57 85 52 16 34 48
#> 2.679902 4.135738 3.360645 3.106418 2.978768 4.542121 2.108123 2.074140
#> 59 67 107 10 62 25 17 12
#> 3.403327 3.188509 3.822816 2.831832 2.872119 3.641068 3.316160 2.588191
#> 16 17 42 36 7 52 7 44
#> 3.208347 2.627438 2.753697 2.849644 3.053536 1.750992 3.288001 2.565320
#> 14 69 98 42 106 43 64 50
#> 3.494627 4.665498 3.289205 1.789143 2.636787 1.205594 3.731749 3.301720
#> 107 11 101 28 73 99 88 63
#> 2.392789 3.779408 3.386390 2.012728 3.844547 1.957984 3.167867 4.129345
#> 69 34 36 98 86 35 42 29
#> 3.612086 2.768118 3.676978 3.340629 3.766601 2.573598 2.680614 2.176216
#> 85 69 69 38 8 12 34 78
#> 2.571079 3.473253 2.737670 3.507288 2.696437 3.004540 3.006698 3.761995
#> 67 29 42 77 91 14 12 69
#> 4.045500 2.792889 2.320264 4.485125 2.324540 3.569736 4.016510 3.456564
#> 59 51 62 40 100 105 52 109
#> 3.580458 2.103570 2.742141 3.051180 2.831785 3.206117 3.302879 3.207901
#> 27 88 103 77 58 7 108 38
#> 2.385346 2.090764 3.679654 2.736382 2.914928 1.201987 2.023332 2.324558
#> 12 64 77 39 93 65 69 65
#> 3.760490 2.528069 3.394093 3.203227 3.022865 3.431177 3.248708 2.412103
#> 78 92 48 36 75 7 34 50
#> 3.858941 3.241836 3.920557 3.271207 2.604875 3.395582 2.321145 2.585861
#> 67 74 43 44
#> 3.805606 4.150609 2.177344 3.618990
```

However, this case is rarely of practical utility because the model is usually evaluated using the OOB score.

Let’s see an example where predictions are calculated for a new set of covariates.

```
# New set of covariates
<- 40
n_obs <- c(runif(n_obs / 4, min = 0, max = 0.55),
x1n runif(n_obs / 4 * 3, min = 0.45, max = 1))
<- factor(c(rbinom(n_obs / 4, 1, 0.1),
x2n rbinom(n_obs / 4, 1, 0.9),
rbinom(n_obs / 2, 1, 0.1)))
<- c(fda.usc::rproc2fdata(n_obs / 2, seq(0, 1, len = 100), sigma = 1),
x3n ::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1,
fda.uscmu = rep(1, 100)),
::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1))
fda.usc<- c(lapply(1:(n_obs / 4 * 3), function(j) igraph::sample_gnp(100, 0.1)),
x4n lapply(1:(n_obs / 4), function(j) igraph::sample_gnp(100, 0.9)))
<- list(X1 = x1n, X2 = x2n, X3 = x3n, X4 = x4n)
new_cov_list
# New response
<- c(rnorm(n_obs / 4, mean = 0, sd = 1),
new_resp_reg rnorm(n_obs / 4, mean = 2, sd = 1),
rnorm(n_obs / 4, mean = 4, sd = 1),
rnorm(n_obs / 4, mean = 6, sd = 1))
# Predictions
<- predict(ef_fit,
new_pred newdata = new_cov_list)
print(new_pred)
#> 28 50 42 36 38 7 7 7
#> 0.1861834 2.6370162 0.2083508 0.5578421 0.3257152 0.3634092 0.4702349 0.1322439
#> 11 42 52 12 52 11 57 11
#> 2.6203437 0.6641498 2.2038720 2.4452980 1.9911947 2.2209439 2.1411308 2.3513359
#> 57 50 52 62 100 99 92 100
#> 1.9500233 2.3971748 1.8652271 2.2018243 4.2073993 4.5920318 3.7338271 4.2410766
#> 109 105 88 88 99 99 17 17
#> 3.6221229 4.1800621 4.4216832 4.3556880 3.9694942 4.0655768 5.8279225 5.8652925
#> 16 67 78 69 77 69 16 67
#> 6.0119947 5.8300567 5.9093864 6.1107621 5.9784577 6.2000173 6.1465376 6.0543855
```

We can compare the Mean Square Error between the response and the predicted values with that between the response and its average to see how predictions are on average closer to the actual values than the response.

```
# MSE between the new response and its average
mean((new_resp_reg - mean(new_resp_reg)) ^ 2)
#> [1] 6.864907
# MSE between the new response and predictions with the new set of covariates
mean((new_resp_reg - new_pred) ^ 2)
#> [1] 1.40344
```

The second example is classification. The set of covariates is the
same as before, i.e., `cov_list`, while the response variable is
`resp_cls`. Also in this case, the most immediate fit can be
obtained by calling `eforest()` and specifying the response and
the set of covariates only. Here, we also set `ntrees = 12` to
reduce the computational burden and `split_type = “coeff”` to
employ the other splitting strategy.

```
# Quick fit
<- eforest(response = resp_cls,
ef_fit covariates = cov_list,
ntrees = 12,
split_type = 'coeff')
```

While everything else works as for regression, for classification
tasks the default performance metric is Balanced Accuracy, while
`random_covs` is set to the square root (rounded down) of the
number of covariates.

Predictions are obtained using the same commands as for regression. However, internally, individual trees’ predictions for single observations are aggregated by majority voting rule.

```
# Predictions from the fitted object
<- predict(ef_fit)
pred print(pred)
#> 34 18 33 7 18 5 19 21 25 18 21 14 7 26 24 14 7 24 31 7 7 10 31 35 35 26
#> 1 2 1 1 4 4 4 2 3 4 3 4 4 2 4 2 1 1 2 2 1 1 1 2 1 1
#> 7 18 14 7 7 7 31 21 5 18 25 31 31 33 34 21 21 7 14 6 7 6 21 34 21 21
#> 1 4 4 1 3 1 1 2 2 4 1 4 3 2 3 2 1 4 4 2 1 2 2 3 2 4
#> 31 7 26 31 21 18 5 21 21 7 24 21 21 21 14 7 7 18 16 18 7 16 34 18 7 7
#> 2 4 1 1 1 1 3 2 2 1 2 1 1 3 1 1 3 4 4 1 4 4 2 2 2 1
#> 29 7 14 33 17 26 18 18 6 19 18 7 18 7 24 18 7 18 34 7 21 21
#> 1 2 1 2 1 4 3 4 1 4 4 2 2 1 2 4 2 4 2 4 2 2
#> Levels: 1 2 3 4
```

To exemplify the use of `predict()` with `newdata`, we
can use the new set of covariates generated for regression, coupling it
with a new response variable for classification.

```
# New response
<- factor(c(sample(x = 1:4, size = n_obs / 4, replace = TRUE,
new_resp_cls prob = c(0.85, 0.05, 0.05, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.85, 0.05, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.05, 0.85, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.05, 0.05, 0.85))))
# Predictions
<- predict(ef_fit,
new_pred newdata = new_cov_list)
print(new_pred)
#> 7 16 7 7 7 7 7 5 25 7 21 21 21 21 21 21 21 21 21 21 33 29 33 34 29 34
#> 1 3 1 1 1 1 1 1 2 1 2 4 2 4 2 2 2 2 2 2 2 3 3 3 2 3
#> 35 35 31 33 19 18 18 18 18 18 18 14 18 17
#> 3 3 3 2 4 4 4 4 4 4 4 4 4 4
#> Levels: 1 2 3 4
# Confusion matrix between the new response and predictions from the fitted tree
table(new_pred, new_resp_cls, dnn = c('Predicted', 'True'))
#> True
#> Predicted 1 2 3 4
#> 1 8 0 0 0
#> 2 2 7 2 1
#> 3 1 0 6 1
#> 4 0 2 0 10
# Misclassification error for predictions on the new set of covariates
sum(new_pred != new_resp_cls) / length(new_resp_cls)
#> [1] 0.225
```

In this case, we obtain a misclassification error of 0.225 on test data, which should be however taken with a grain of salt given the very small number of trees that have been used in this toy example.

R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf.