robustlm: Robust Variable Selection with Exponential Squared Loss

Borui Tang, Jin Zhu

The goal of the robustlm package is to carry out robust variable selection through exponential squared loss (Wang et al. 2013). Specifically, it solves the following optimization problem:

\[ \arg\min_{\beta} \sum_{i=1}^n(1-\exp\{-(y_i-x_i^T\beta)^2/\gamma_n\})+n\sum_{i=1}^d \lambda_{n j}|\beta_{j}|, \] where penalty function is the well-known adaptive LASSO (Zou 2006). \(y_i\)s are responses and \(x_i\) are predictors. \(\gamma_n\) is the tuning parameter controlling the degree of robustness and efficiency of the regression estimators. Block coordinate gradient descent algorithm (Tseng and Yun 2009) is used to efficiently solve the optimization problem. The tuning parameter \(\gamma_n\) and regularization parameter \(\lambda_{nj}\) are chosen adaptively by default, while they can be supplied by the user.

Quick Start

This is a basic example which shows you how to use this package. First, we generate data which contain influential points in the response. Let covariate \(x_i\) follows a multivariate normal distribution \(N(0, \Sigma)\) where \(\Sigma_{ij}=0.5^{|i-j|}\), and the error term \(\epsilon\) follows a mixture normal distribution \(0.8N(0,1)+0.2N(10,6^2)\). The response is generated by \(Y=X^T\beta+\epsilon\), where \(\beta=(1, 1.5, 2, 1, 0, 0, 0, 0)^T\).

set.seed(1)
library(MASS)
N <- 1000
p <- 8
rho <- 0.5
beta_true <- c(1, 1.5, 2, 1, 0, 0, 0, 0)
H <- abs(outer(1:p, 1:p, "-"))
V <- rho^H
X <- mvrnorm(N, rep(0, p), V)

# generate error term from a mixture normal distribution
components <- sample(1:2, prob=c(0.8, 0.2), size=N, replace=TRUE)
mus <- c(0, 10)
sds <- c(1, 6)
err <- rnorm(n=N,mean=mus[components],sd=sds[components])

Y <- X %*% beta_true + err

Regression diagnostics for simple linear regression provide some intuition about the data.

plot(lm(Y ~ X))

The Q-Q plot suggests the normal assumption is not valid. Thus, a robust variable selection procedure is needed. We apply robustlm function to select important variables:

library(robustlm)
robustlm1 <- robustlm(X, Y)
robustlm1
#> $beta
#> [1] 0.9411568 1.5839011 2.0716890 0.9489619 0.0000000 0.0000000 0.0000000
#> [8] 0.0000000
#> 
#> $alpha
#> [1] 0
#> 
#> $gamma
#> [1] 8.3
#> 
#> $weight
#> [1]   87.140346    7.033846    4.340160    3.343782    6.833033  703.863830
#> [7]  193.860493  858.412613 2183.876884
#> 
#> $loss
#> [1] 250.3821

The estimated regression coefficients \((0.94, 1.58, 2.07, 0.95, 0.00, 0.00, 0.00, 0.00)\) are close to the true values\((1, 1.5, 2, 1, 0, 0, 0, 0)\). There is no mistaken selection or discard.

robustlm package also provides generic coef function to extract estimated coefficients and predict function to make a prediction:

coef(robustlm1)
#> $alpha
#> [1] 0
#> 
#> $beta
#> [1] 0.9411568 1.5839011 2.0716890 0.9489619 0.0000000 0.0000000 0.0000000
#> [8] 0.0000000
Y_pred <- predict(robustlm1, X)
head(Y_pred)
#>            [,1]
#> [1,]  4.3571855
#> [2,]  0.2831909
#> [3,]  3.0404662
#> [4,] -1.1539956
#> [5,]  0.9718605
#> [6,] -0.5732342

Boston Housing Price Dataset

We apply this package to analysis the Boston Housing Price Dataset, which is available in MASS package. The data contain 14 variables and 506 observations. The response variable is medv (median value of owner-occupied homes in thousand dollars), and the rest are the predictors. Here the predictors are scaled to have zero mean and unit variance. The responses are centerized.

data(Boston, package = "MASS")
head(Boston)
#>      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
#> 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
#> 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
#> 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
#> 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
#> 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
#> 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
#>   medv
#> 1 24.0
#> 2 21.6
#> 3 34.7
#> 4 33.4
#> 5 36.2
#> 6 28.7
# scaling and centering
Boston[, -14] <- scale(Boston[, -14])
Boston[, 14] <- scale(Boston[, 14], scale = FALSE)

# diagnostic
set.seed(1)
x <- as.matrix(Boston[, -14])
y <- Boston[, 14]
lm_OLS <- lm(y ~ x - 1)
plot(lm_OLS)

The diagnostic plots suggest the residuals may not follow normal distribution, we use robustlm to carry out variable selection with robustness.

# robustlm
robustlm2 <- robustlm(x, y)
coef(robustlm2)
#> $alpha
#> [1] -0.8696961
#> 
#> $beta
#>  [1]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  4.4744083
#>  [7] -0.7759438 -0.6542423  0.0000000 -0.6474789 -1.2535192  1.3645150
#> [13] -1.6753801

In this example, robustlm selected seven of the predictors while discarding six of them. Take a look at the correlations, it appears robustlm tends to select variables which have higher correlations with the response.

cor(x, y)
#>               [,1]
#> crim    -0.3883046
#> zn       0.3604453
#> indus   -0.4837252
#> chas     0.1752602
#> nox     -0.4273208
#> rm       0.6953599
#> age     -0.3769546
#> dis      0.2499287
#> rad     -0.3816262
#> tax     -0.4685359
#> ptratio -0.5077867
#> black    0.3334608
#> lstat   -0.7376627

The following procedure roughly illustrates the robustness. Take the predictor rm (average number of rooms per dwelling) as an example. We draw a 2-dimensional scatter plot with the fitted regression line on it. As shown below, the fitted line is not heavily affected by the outliers.

References

Tseng, Paul, and Sangwoon Yun. 2009. “A Coordinate Gradient Descent Method for Nonsmooth Separable Minimization.” Mathematical Programming 117 (1): 387–423. https://doi.org/10.1007/s10107-007-0170-0.

Wang, Xueqin, Yunlu Jiang, Mian Huang, and Heping Zhang. 2013. “Robust Variable Selection with Exponential Squared Loss.” Journal of the American Statistical Association 108 (502): 632–43. https://doi.org/10.1080/01621459.2013.766613.

Zou, Hui. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101 (476): 1418–29. https://doi.org/10.1198/016214506000000735.