# A Short Introduction to MAKL Package

For a better understanding of MAKL library, we build a simple example in this document. We first create a synthetic dataset that consists of 1000 rows and 6 features, using standard Gaussian distribution.

library(MAKL)
set.seed(64327) #midas
df <- matrix(rnorm(6000, 0, 1), nrow = 1000)
colnames(df) <- c("F1", "F2", "F3", "F4", "F5", "F6")

As to membership argument of makl_train(), we prepare a list consisting of two groups such that the first one contains the features F1, F5 and F6; the second one contains the rest. Note that the column names of the input dataset should be a superset of the union of all feature names in the groups list.

# check colnames(df) for them to be matching with group members
groups <- list()
groups[] <- c("F1", "F5", "F6")
groups[] <- c("F2", "F3", "F4")

We then create the response vector y such that it will be dependent on the second, the third and the fourth features, namely F2, F3 and F4: If, for a data instance, the sum of entries in the second, the third and the fourth columns is positive, the corresponding response is assigned +1, else, it is assigned -1.

y <- c()
for(i in 1:nrow(df)) {
if((df[i, 2] + df[i, 3] + df[i, 4]) > 0) {
y[i] <- +1
} else {
y[i] <- -1
}
}

We use the synthetic dataset df and response vector y as our train dataset and train response vector in makl_train(), we choose the number of random features D equal to 2 which makes sense knowing that our train dataset is 6 dimensional. We choose the number of rows to be used for distance matrix calculation, sigma_N equal to 1000, and lambda_set consisting of 0.9, 0.8, 0.7, 0.6 for sparse solutions. As membership list, we use the groups list that we created above.

makl_model <- makl_train(X = df, y = y, D = 2, sigma_N = 1000, CV = 1, membership = groups, lambda_set = c(0.9, 0.8, 0.7, 0.6))
#> Lambda: 155.0901  nr.var: 5
#> Lambda: 137.8579  nr.var: 5
#> Lambda: 120.6257  nr.var: 5
#> Lambda: 103.3934  nr.var: 5

When we check the coefficients of our model, we see that the chosen kernel for prediction by makl_train() was the kernel of the second group. This was an expected result since we created the response vector y to be dependent on the second group members of the groups list.

makl_model$model$coefficients
#>       155.090126229481 137.857889981761 120.625653734041 103.39341748632
#>  [1,]       0.00000000        0.0000000        0.0000000       0.0000000
#>  [2,]       0.00000000        0.0000000        0.0000000       0.0000000
#>  [3,]       0.00000000        0.0000000        0.0000000       0.0000000
#>  [4,]       0.00000000        0.0000000        0.0000000       0.0000000
#>  [5,]      -0.29314353       -0.5938544       -0.9106226      -1.2539243
#>  [6,]       0.06703617        0.1352210        0.2057486       0.2799665
#>  [7,]       0.24539658        0.4973664        0.7630398       1.0509792
#>  [8,]      -0.36108294       -0.7320709       -1.1246002      -1.5535840
#>  [9,]       0.12450233        0.1542956        0.1858601       0.2195980

Now, let us create a synthetic dataset df_test and a synthetic test response vector y_test to use in makl_test() to check the results.

df_test <- matrix(rnorm(600, 0, 1), nrow = 100)
colnames(df_test) <- c("F1", "F2", "F3", "F4", "F5", "F6")
y_test <- c()
for(i in 1:nrow(df_test)) {
if((df_test[i, 2] + df_test[i, 3] + df_test[i, 4]) > 0) {
y_test[i] <- +1
} else {
y_test[i] <- -1
}
}
result <-makl_test(X = df_test, y = y_test, makl_model = makl_model)

The list result contains two elements: 1) The predictions for the test response vector y_test and 2) The area under the ROC curve (AUROC) versus the number of selected kernels values for each element in the lambda_set if CV is not applied; the area under the ROC curve versus the number of selected kernels value for the best lambda in the lambda_set if CV is applied.

result\$auroc_kernel_number
#>     auroc_array n_selected_kernels
#> 0.9   0.9494179                  1
#> 0.8   0.9494179                  1
#> 0.7   0.9498193                  1
#> 0.6   0.9498193                  1