FeatureTerminatoR

Loading the packages

To load the package, you can use the below command:

library(FeatureTerminatoR)
library(caret)
library(dplyr)
library(ggplot2)
library(randomForest)

Recursive Feature Elimination

The trick to this is to use cross validation, or repeated cross validation, to eliminate n features from the model. This is achieved by fitting the model multiple times at each step, removing the weakest features, determining by either the coefficients in the model, or by the feature importance attributes in the model.

Within the package there is a number of different types you can utilise:

See the underlying caretFuncs() documentation.

The model implements all these methods. I will utilise the random forest variable importance selection method, as this is quick to train on our test dataset.

Using the rfe_removeR function in FeatureTerminatoR

The following steps will take you through how to use this function.

Loading the test data

For the test data we will use the in built iris dataset.

df <- iris
print(head(df,10))
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa

Fitting a RFE method to the data

Now is the time to use the workhouse function for the RFE (Recursive Feature Elimination) methods:

#Passing in the indexes as slices x values located in index 1:4 and y value in location 5
rfe_fit <- rfeTerminator(df, x_cols= 1:4, y_cols=5, alter_df = TRUE, eval_funcs = rfFuncs)
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Length
#> [IVS SELECTED] Optimal variables are: Petal.Width
#Passing by column name
rfe_fit_col_name <- rfeTerminator(df, x_cols=1:4, y_cols="Species", alter_df=TRUE)
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Width
#> [IVS SELECTED] Optimal variables are: Petal.Length
# A further example
ref_x_col_name <- rfeTerminator(df,
                                x_cols=c("Sepal.Length", "Sepal.Width",
                                        "Petal.Length", "Petal.Width"),
                                y_cols = "Species")
#> [INFO] Removing features as a result of recursive feature enginnering. Expose rfe_reduced_data from returned list using $ selectors.
#> [IVS SELECTED] Optimal variables are: Petal.Length
#> [IVS SELECTED] Optimal variables are: Petal.Width

This shows that it does not matter how you pass the data to the function, but the x column names need to be wrapped in a vector, as the further example highlights. Otherwise, you can simply pass the columns as a slice of the data frame.

Exploring the model output results

The model will select the best combination of values, with the sizes argument indicating the range of numeric features to retain. This defaults to an integer column slice between 1:10.

#Explore the optimal model results
print(rfe_fit$rfe_model_fit_results)
#> 
#> Recursive feature selection
#> 
#> Outer resampling method: Cross-Validated (10 fold) 
#> 
#> Resampling performance over subset size:
#> 
#>  Variables Accuracy Kappa AccuracySD KappaSD Selected
#>          1   0.9067  0.86    0.05622 0.08433         
#>          2   0.9600  0.94    0.06441 0.09661        *
#>          3   0.9467  0.92    0.06126 0.09189         
#>          4   0.9533  0.93    0.06325 0.09487         
#> 
#> The top 2 variables (out of 2):
#>    Petal.Length, Petal.Width
#View the optimum variables selected
print(rfe_fit$rfe_model_fit_results$optVariables)
#> [1] "Petal.Length" "Petal.Width"

Outputting the original and reduced data

The following list type will retain the original data, with the alter_df argument indicating if the results should be outputted for manual evaluation of the backward elimination, or whether the data frame should be reduced. This could be the full data before a training / testing split, or on the training set, dependent on your ML pipeline strategy.

Viewing the original data

To view the original data:

#Explore the original data passed to the frame
print(head(rfe_fit$rfe_original_data))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

Obtaining the data after rfe termination

Viewing the outputs post termination, you can observe that the features that have little bearing on the dependent (predicted variable) are terminated:

#Explore the data adapted with the less important features removed
print(head(rfe_fit$rfe_reduced_data))
#>   Petal.Length Petal.Width Species
#> 1          1.4         0.2  setosa
#> 2          1.4         0.2  setosa
#> 3          1.3         0.2  setosa
#> 4          1.5         0.2  setosa
#> 5          1.4         0.2  setosa
#> 6          1.7         0.4  setosa

The features that do not have a significant impact have been removed from your model and this would surely speed up the ML or predictive model prior to training it.

Next, we move on to another feature selection method, this time we are utilising a correlation method to remove potential affects of multicollinearity.

Removing High Correlated Features - multicol_terminatoR

The main reason you would want to do this is to avoid multicollinearity. This is an effect caused when there are high intercorrelations among two or more independent variables in linear models, this is not so much of a problem with non-linear models, such as trees, but can still cause high variance in the models, thus scaling of independent variables is always recommended.

Why bother about multicollinearity?

In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model. That is, the statistical inferences from a model with multicollinearity may not be dependable.

Key takeaways:

This is why you would want to remove highly correlated features.

Getting started with the high correlation removal

We already have our test data loaded in, and we will use the dataset from the previous example in this example.

#Fit a model on the results and define a confidence cut off limit
mc_term_fit <- FeatureTerminatoR::mutlicol_terminator(df, x_cols=1:4,
                                   y_cols="Species",
                                   alter_df=TRUE,
                                   cor_sig = 0.90)
#> [INFO] Removing features as a result of highly correlated value cut off.

Visualising the outputs

Exploring the outputs:

# Visualise the quantile distributions of where the correlations lie
mc_term_fit$corr_quant_chart

This shows that our cut off range starts at about the 85th percentile of the correlation distributions, at the top end. This would also work for strong negative associations. Here, we could probably be a little more strict in our 90% limit, but we will keep it at this for now, as we do not want to purge all the features.

Viewing the raw correlation and covariance matrices

This has been built into the tool for ease:

# View the correlation matrix
mc_term_fit$corr_matrix
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
#> Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
#> Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
#> Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
# View the covariance matrix
mc_term_fit$cov_matrix
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
#> Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
#> Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
#> Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063
# View the quantile range
mc_term_fit$corr_quantile #This excludes the diagonal correlations, as this would inflate the quantile distribution
#>         5%        10%        15%        20%        25%        30%        35% 
#> -0.4284401 -0.4222087 -0.3879359 -0.3661259 -0.3661259 -0.2915591 -0.1548532 
#>        40%        45%        50%        55%        60%        65%        70% 
#> -0.1175698 -0.1175698  0.3501857  0.8179411  0.8179411  0.8260130  0.8556100 
#>        75%        80%        85%        90%        95%       100% 
#>  0.8717538  0.8717538  0.9036429  0.9537543  0.9628654  0.9628654

There is some strong correlations between petal length and petal width, so these will be clipped by our choice of cut-off.

Viewing the reduced data

To get the outputs from the feature selection method, we use the following call to obtain the output tibble:

# Get the removed and reduced data
new_df_post_feature_removal <- mc_term_fit$feature_removed_df
glimpse(new_df_post_feature_removal)
#> Rows: 150
#> Columns: 4
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
#> $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
#> $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
#> $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~

Here, the algorithm has removed a value based off the cut-off limit provided.

Still to be included

These algorithms will form the first version of the package, but still to be developed are: