`L0Learn`

is a fast toolkit for L0-regularized learning.
L0 regularization selects the best subset of features and can outperform
commonly used feature selection methods (e.g., L1 and MCP) under many
sparse learning regimes. The toolkit can (approximately) solve the
following three problems \[
\min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i,
\beta \rangle) + \lambda ||\beta||_0 \quad \quad (L0)
\] \[
\min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i,
\beta \rangle) + \lambda ||\beta||_0 + \gamma||\beta||_1 \quad (L0L1)
\]

\[ \min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i, \beta \rangle) + \lambda ||\beta||_0 + \gamma||\beta||_2^2 \quad (L0L2) \] where \(\ell\) is the loss function, \(\beta_0\) is the intercept, \(\beta\) is the vector of coefficients, and \(||\beta||_0\) denotes the L0 norm of \(\beta\), i.e., the number of non-zeros in \(\beta\). We support both regression and classification using either one of the following loss functions:

- Squared error loss
- Logistic loss (logistic regression)
- Squared hinge loss (smoothed version of SVM).

The parameter \(\lambda\) controls the strength of the L0 regularization (larger \(\lambda\) leads to less non-zeros). The parameter \(\gamma\) controls the strength of the shrinkage component (which is the L1 norm in case of L0L1 or squared L2 norm in case of L0L2); adding a shrinkage term to L0 can be very effective in avoiding overfitting and typically leads to better predictive models. The fitting is done over a grid of \(\lambda\) and \(\gamma\) values to generate a regularization path.

The algorithms provided in `L0Learn`

are based on cyclic
coordinate descent and local combinatorial search. Many computational
tricks and heuristics are used to speed up the algorithms and improve
the solution quality. These heuristics include warm starts, active set
convergence, correlation screening, greedy cycling order, and efficient
methods for updating the residuals through exploiting sparsity and
problem dimensions. Moreover, we employed a new computationally
efficient method for dynamically selecting the regularization parameter
\(\lambda\) in the path. For more
details on the algorithms used, please refer to our paper Fast
Best Subset Selection: Coordinate Descent and Local Combinatorial
Optimization Algorithms.

The toolkit is implemented in C++ along with an easy-to-use R interface. In this vignette, we provide a tutorial on using the R interface. Particularly, we will demonstrate how use L0Learnâ€™s main functions for fitting models, cross-validation, and visualization.

L0Learn can be installed directly from CRAN by executing:

If you face installation issues, please refer to the Installation Troubleshooting Wiki. If the issue is not resolved, you can submit an issue on L0Learnâ€™s Github Repo.

To demonstrate how `L0Learn`

works, we will first generate
a synthetic dataset and then proceed to fitting L0-regularized models.
The synthetic dataset (y,X) will be generated from a sparse linear model
as follows:

- X is a 500x1000 design matrix with iid standard normal entries
- B is a 1000x1 vector with the first 10 entries set to 1 and the rest are zeros.
- e is a 500x1 vector with iid standard normal entries
- y is a 500x1 response vector such that y = XB + e

This dataset can be generated in R as follows:

```
set.seed(1) # fix the seed to get a reproducible result
X = matrix(rnorm(500*1000),nrow=500,ncol=1000)
B = c(rep(1,10),rep(0,990))
e = rnorm(500)
y = X%*%B + e
```

We will use L0Learn to estimate B from the data (y,X). First we load L0Learn:

We will start by fitting a simple L0 model and then proceed to the case of L0L2 and L0L1.

To fit a path of solutions for the L0-regularized model with at most
20 non-zeros using coordinate descent (CD), we use the
`L0Learn.fit`

function as follows:

This will generate solutions for a sequence of \(\lambda\) values (chosen automatically by
the algorithm). To view the sequence of \(\lambda\) along with the associated support
sizes (i.e., the number of non-zeros), we use the `print`

method as follows:

```
#> lambda gamma suppSize
#> 1 0.068285500 0 0
#> 2 0.067602600 0 1
#> 3 0.055200200 0 2
#> 4 0.049032300 0 3
#> 5 0.040072500 0 6
#> 6 0.038602900 0 7
#> 7 0.037264900 0 8
#> 8 0.032514200 0 10
#> 9 0.001142920 0 11
#> 10 0.000821227 0 13
#> 11 0.000702292 0 14
#> 12 0.000669520 0 15
#> 13 0.000489938 0 17
```

To extract the estimated B for particular values of \(\lambda\) and \(\gamma\), we use the function
`coef(fit,lambda,gamma)`

. For example, the solution at \(\lambda = 0.0325142\) (which corresponds to
a support size of 10) can be extracted using

```
#> 1001 x 1 sparse Matrix of class "dgCMatrix"
#>
#> Intercept 0.01052346
#> V1 1.01601401
#> V2 1.01829953
#> V3 1.00606260
#> V4 0.98308999
#> V5 0.97389545
#> V6 0.96148014
#> V7 1.00990709
#> V8 1.08535033
#> V9 1.02685765
#> V10 0.94236638
#> V11 .
#> V12 .
...
```

The output is a sparse vector of type `dgCMatrix`

. The
first element in the vector is the intercept and the rest are the B
coefficients. Aside from the intercept, the only non-zeros in the above
solution are coordinates 1, 2, 3, â€¦, 10, which are the non-zero
coordinates in the true support (used to generated the data). Thus, this
solution successfully recovers the true support. Note that on some BLAS
implementations, the `lambda`

value we used above (i.e.,
`0.0325142`

) might be slightly different due to the
limitations of numerical precision. Moreover, all the solutions in the
regularization path can be extracted at once by calling
`coef(fit)`

.

The sequence of \(\lambda\)
generated by `L0Learn`

is stored in the object
`fit`

. Specifically, `fit$lambda`

is a list, where
each element of the list is a sequence of \(\lambda\) values corresponding to a single
value of \(\gamma\). Since L0 has only
one value of \(\gamma\) (i.e., 0), we
can access the sequence of \(\lambda\)
values using `fit$lambda[[1]]`

. Thus, \(\lambda=0.0325142\) we used previously can
be accessed using `fit$lambda[[1]][7]`

(since it is the 7th
value in the output of `print`

). So the previous solution can
also be extracted using
`coef(fit,lambda=fit$lambda[[1]][7], gamma=0)`

.

We can make predictions using a specific solution in the grid using
the function `predict(fit,newx,lambda,gamma)`

where newx is a
testing sample (vector or matrix). For example, to predict the response
for the samples in the data matrix X using the solution with \(\lambda=0.0325142\), we call the prediction
function as follows:

```
#> 500 x 1 Matrix of class "dgeMatrix"
#> [,1]
#> [1,] 0.44584683
#> [2,] -1.52213221
#> [3,] -1.11755544
#> [4,] -0.93180249
#> [5,] -4.02095821
#> [6,] 2.02300763
#> [7,] 2.03371819
#> [8,] 0.92234198
#> [9,] 2.33359737
#> [10,] 0.92194909
#> [11,] -4.33165265
#> [12,] -2.13282518
#> [13,] -7.21962511
...
```

We can also visualize the regularization path by plotting the
coefficients of the estimated B versus the support size (i.e., the
number of non-zeros) using the `plot(fit,gamma)`

method as
follows: