2022-04-24

Idea of the `mlapi`

package is to provide guideline on how
to implement interfaces of the machine learning models in order to have
unified consistent flow. API design is mainly borrowed from very
successful python `scikit-learn`

package. At the moment scope
is limited to the following **base classes**:

`mlapiEstimation`

/`mlapiEstimationOnline`

- models which implements supervised learning -**regression**or**classification**`mlapiTransformation`

/`mlapiTransformationOnline`

- models which learn**transformations**of the data. For example model can learn TF-IDF on some matrix and apply it to the other holdout matrix`mlapiDecomposition`

/`mlapiDecompositionOnline`

- models which**decompose**input matrix into two matrices (usually low rank). A good example could be matrix factorization where input matrix \(X\) decomposed into 2 matrices \(P\) and \(Q\) so \(X \approx P Q\).

All the base classes above suggest developer to implement set of
methods and expose set of members. Developer should provide realization
of the class which **inherits from a corresponding base
class** above.

There are several **agreements** which helps to maintain
consistent workflow.

- In opposite to the most of the R packages
`mlapi`

defines*models to be mutable*and internally implemented as`R6`

classes. - Model creation is a declarative process where all the model
parameters should be passed to the constructor. Model creation is
separate to model fitting:
`model = SomeModel$new(param_1 = 1, param_2 = 10)`

. - Depending on the base class models should implement following
methods for model training:
-`fit`

`mlapiEstimation`

-`fit_transform`

`mlapiTransformation`

,`mlapiDecomposition`

-`partial_fit`

`mlapiEstimationOnline`

,`mlapiTransformationOnline`

,`mlapiDecompositionOnline`

- Depending on the base class models should implement following
methods for model transformations/predictions:
-`predict`

`mlapiEstimation`

,`mlapiEstimationOnline`

-`transform`

`mlapiTransformation`

,`mlapiTransformationOnline`

,`mlapiDecomposition`

,`mlapiDecompositionOnline`

- After
`mlapiDecomposition`

/`mlapiDecompositionOnline`

model fitting field`private$components_`

should be initialized (mind undescore at the end!). It should contain**matrix**\(Q\) (as per \(X \approx P Q\)). - All the methods above should
**work only with matrices**- dense or sparse. Dense matrices usually are from`base`

package and sparse matrices from`Matrix`

package.

This allows us to create concise pipelines which easy to train and apply to new data (details in next section):

```
# transformer:
# scaler just divide each column by std_dev
= Scaler$new()
scaler
# decomposition:
# fits truncated SVD: X = U * S * V
# or rephrasing X = P * Q where P = U * sqrt(S); Q = sqrt(S) * V
# as a result trunc_svd$fit_transform(train) returns matrix P and learns matrix Q (stores inside model)
# when trunc_svd$transform(test) is called, model use matrix Q in order to find matrix P for `test` data
= SVD$new(rank = 16)
trunc_svd
# estimator:
# fit L1/L2 regularized logistic regression
= LogisticRegression(L1 = 0.1, L2 = 10) logreg
```

```
%>%
train fit_transform(scaler) %>%
fit_transform(trunc_svd) %>%
fit(logreg)
```

Now all models are fitted.

```
= test %>%
predictions transform(scaler) %>%
transform(trunc_svd) %>%
predict(logreg)
```

```
= R6::R6Class(
SimpleLinearModel classname = "mlapiSimpleLinearModel",
inherit = mlapi::mlapiEstimation,
public = list(
initialize = function(tol = 1e-7) {
$tol = tol
private$set_internal_matrix_formats(dense = "matrix", sparse = NULL)
super
},fit = function(x, y, ...) {
= super$check_convert_input(x)
x stopifnot(is.vector(y))
stopifnot(is.numeric(y))
stopifnot(nrow(x) == length(y))
$n_features = ncol(x)
private$coefficients = .lm.fit(x, y, tol = private$tol)[["coefficients"]]
private
},predict = function(x) {
stopifnot(ncol(x) == private$n_features)
%*% matrix(private$coefficients, ncol = 1)
x
}
),private = list(
tol = NULL,
coefficients = NULL,
n_features = NULL
) )
```

```
set.seed(1)
= SimpleLinearModel$new()
model = matrix(sample(100 * 10, replace = TRUE), ncol = 10)
x = sample(c(0, 1), 100, replace = TRUE)
y $fit(as.data.frame(x), y)
model= model$predict(x)
res1 # check pipe-compatible S3 interface
= predict(x, model)
res2 identical(res1, res2)
```

`## [1] TRUE`

```
= R6::R6Class(
TruncatedSVD classname = "TruncatedSVD",
inherit = mlapi::mlapiDecomposition,
public = list(
initialize = function(rank = 10) {
$rank = rank
private$set_internal_matrix_formats(dense = "matrix", sparse = NULL)
super
},fit_transform = function(x, ...) {
= super$check_convert_input(x)
x $n_features = ncol(x)
private= svd(x, nu = private$rank, nv = private$rank, ...)
svd_fit = svd_fit$d[seq_len(private$rank)]
sing_values = svd_fit$u %*% diag(x = sqrt(sing_values))
result $components_ = t(svd_fit$v %*% diag(x = sqrt(sing_values)))
privaterm(svd_fit)
rownames(result) = rownames(x)
colnames(private$components_) = colnames(x)
$fitted = TRUE
privateinvisible(result)
},transform = function(x, ...) {
if (private$fitted) {
stopifnot(ncol(x) == ncol(private$components_))
= tcrossprod(private$components_)
lhs = as.matrix(tcrossprod(private$components_, x))
rhs t(solve(lhs, rhs))
}else
stop("Fit the model first woth model$fit_transform()!")
}
),private = list(
rank = NULL,
n_features = NULL,
fitted = NULL
) )
```

```
set.seed(1)
= TruncatedSVD$new(2)
model = matrix(sample(100 * 10, replace = TRUE), ncol = 10)
x = model$fit_transform(x)
x_trunc dim(x_trunc)
```

`## [1] 100 2`

```
= model$transform(x)
x_trunc_2 sum(x_trunc_2 - x_trunc)
```

`## [1] -9.428555e-12`

```
# check pipe-compatible S3 interface
= transform(x, model)
x_trunc_2_s3 identical(x_trunc_2, x_trunc_2_s3)
```

`## [1] TRUE`