SVEMnet Vignette

Andrew T. Karl

December 21, 2024

Version

version 1.3.0

Summary

SVEMnet implements Self-Validated Ensemble Models (SVEM, Lemkus et al. 2021) and the SVEM whole model test (Karl 2024) using Elastic Net regression via the glmnet package Friedman et al. (2010). This vignette provides an overview of the package’s functionality and usage.

Preface - Note from the author

The motivation to create the SVEMnet package was primarily to have a personal sandbox to explore SVEM performance in different scenarios and with various modifications to its structure. I did not originally intend to publish it, but after having used it for a while I believe it could be useful to others.

As noted in the documentation, I used GPT o1-preview to help form the code structure of the package and to code the Roxygen structure of the documentation. The SVEM significance test R code comes from the supplementary material of Karl (2024). I wrote that code by hand and validated each step (not including the creation of the SVEM predictions) against corresponding results in JMP (the supplementary material of Karl (2024) provides the matching JSL script). For the SVEMnet() code, assuming only a single value of alpha for glmnet is being tested, the heart of the SVEM code is simply

#partial code for illustration of the SVEM loop
coef_matrix <- matrix(NA, nrow = nBoot, ncol = p + 1)
 for (i in 1:nBoot) {
      U <- runif(n)
      w_train <- -log(U)
      w_valid <- -log(1 - U)
      #match glmnet normalization of training weight vector
      w_train <- w_train * (n / sum(w_train))
      w_valid <- w_valid * (n / sum(w_valid))
      glmnet(
          X, y_numeric,
          alpha = alpha,
          weights = w_train,
          intercept = TRUE,
          standardize = standardize,
          maxit = 1e6,
          nlambda = 500
      )
      predict(fit, newx = X)
      val_errors <- colSums(w_valid * (y_numeric - pred_valid)^2)
      k_values <- fit$df
      n_obs <- length(y_numeric)
      aic_values <- n_obs * log(val_errors / n_obs) + 2 * k_values
         # Choose lambda
      if (objective == "wSSE") {
        idx_min <- which.min(val_errors)
        lambda_opt <- fit$lambda[idx_min]
        val_error <- val_errors[idx_min]
      } else if (objective == "wAIC") {
        idx_min <- which.min(aic_values)
        lambda_opt <- fit$lambda[idx_min]
        val_error <- aic_values[idx_min]
      }
      coef_matrix[i, ] <- as.vector(coef(fit, s = lambda_opt))
}

However, to get this to a stable implementation that includes error and warning handling and structure to pass to S3 methods for predict(), coef(), plot(), etc, it was only practical for me to utilize help from GPT o1-preview. I simply would not have taken the time to add that structure otherwise, and my implementation would have been inferior. I reviewed any of the code that was generated from this tool before integrating it, and corrected its occasional mistakes. If someone would like to create a purely human-written set of code for a similar purpose, let me know and I will be happy to add links to your package and a description to the SVEMnet documentation.

SVEMnet Example 1

library(SVEMnet)

# Example data
data <- iris
svem_model <- SVEMnet(Sepal.Length ~ ., data = data, nBoot = 300)
coef(svem_model)
##                   Percent of Bootstraps Nonzero
## Sepal.Width                           1.0000000
## Petal.Length                          1.0000000
## Speciesvirginica                      0.9333333
## Speciesversicolor                     0.9300000
## Petal.Width                           0.9166667

Generate a plot of actual versus predicted values:

plot(svem_model)

Predict outcomes for new data using the predict() function:

predictions <- predict(svem_model, data)
print(predictions)
##        1        2        3        4        5        6        7        8 
## 5.005617 4.740836 4.773051 4.867489 5.058573 5.383305 4.925047 5.026358 
##        9       10       11       12       13       14       15       16 
## 4.687880 4.895104 5.185226 5.100055 4.768450 4.547358 5.123003 5.500692 
##       17       18       19       20       21       22       23       24 
## 5.088516 4.978003 5.357963 5.210569 5.173753 5.129998 4.763784 5.037954 
##       25       26       27       28       29       30       31       32 
## 5.321147 4.888231 5.044827 5.079314 4.952661 4.994143 4.941187 4.971130 
##       33       34       35       36       37       38       39       40 
## 5.424665 5.376310 4.867489 4.699354 4.931919 5.086187 4.667139 5.026358 
##       41       42       43       44       45       46       47       48 
## 4.904305 4.268832 4.773051 5.042555 5.477744 4.713222 5.311880 4.846748 
##       49       50       51       52       53       54       55       56 
## 5.185226 4.899704 6.473589 6.298580 6.540413 5.508716 6.160453 6.141983 
##       57       58       59       60       61       62       63       64 
## 6.471317 5.128633 6.268637 5.619229 5.064203 5.971576 5.538602 6.314720 
##       65       66       67       68       69       70       71       72 
## 5.531663 6.199541 6.192668 5.877080 5.769018 5.596159 6.436830 5.773497 
##       73       74       75       76       77       78       79       80 
## 6.222676 6.316992 6.047545 6.146584 6.335461 6.505926 6.139712 5.381940 
##       81       82       83       84       85       86       87       88 
## 5.469505 5.423422 5.674457 6.448369 6.192668 6.376878 6.393019 5.803505 
##       89       90       91       92       93       94       95       96 
## 5.953106 5.614628 5.989988 6.293979 5.695198 5.075677 5.867935 6.054418 
##       97       98       99      100      101      102      103      104 
## 5.973848 6.047545 4.932883 5.847194 6.965055 6.149726 6.842945 6.651739 
##      105      106      107      108      109      110      111      112 
## 6.741634 7.358827 5.656858 7.167620 6.587309 7.197621 6.386893 6.297121 
##      113      114      115      116      117      118      119      120 
## 6.548156 5.942502 6.064612 6.451445 6.630998 7.828559 7.312866 5.921704 
##      121      122      123      124      125      126      127      128 
## 6.746235 6.027673 7.354226 6.029945 6.854419 7.105397 6.009204 6.188814 
##      129      130      131      132      133      134      135      136 
## 6.515941 6.907318 6.939655 7.662695 6.488327 6.313138 6.603326 6.935112 
##      137      138      139      140      141      142      143      144 
## 6.750836 6.683954 6.115116 6.527415 6.591967 6.251095 6.149726 6.893629 
##      145      146      147      148      149      150 
## 6.743963 6.271836 5.970116 6.354678 6.631055 6.336208

Whole Model Significance Testing

This is the serial version of the significance test. It is slower but the code is less complicated to read than the faster parallel version.

test_result <- svem_significance_test(Sepal.Length ~ ., data = data)
print(test_result)
plot(test_result)
SVEM Significance Test p-value:
[1] 0
Whole model test result

Whole model test result

Note that there is a parallelized version that runs much faster

test_result <- svem_significance_test_parallel(Sepal.Length ~ ., data = data)
print(test_result)
plot(test_result)
SVEM Significance Test p-value:
[1] 0

SVEMnet Example 2

# Simulate data
set.seed(1)
n <- 25
X1 <- runif(n)
X2 <- runif(n)
X3 <- runif(n)
X4 <- runif(n)
X5 <- runif(n)

#y only depends on X1 and X2
y <- 1 + X1 +  X2 + X1 * X2 + X1^2 + rnorm(n)
data <- data.frame(y, X1, X2, X3, X4, X5)

# Perform the SVEM significance test
test_result <- svem_significance_test_parallel(
  y ~ (X1 + X2 + X3)^2 + I(X1^2) + I(X2^2) + I(X3^2),
  data = data

)

# View the p-value
print(test_result)
SVEM Significance Test p-value:
[1] 0.009399093


test_result2 <- svem_significance_test_parallel(
  y ~ (X1 + X2 )^2 + I(X1^2) + I(X2^2),
  data = data
)

# View the p-value
print(test_result2)
SVEM Significance Test p-value:
[1] 0.006475736

#note that the response does not depend on X4 or X5
test_result3 <- svem_significance_test_parallel(
  y ~ (X4 + X5)^2 + I(X4^2) + I(X5^2),
  data = data
)

# View the p-value
print(test_result3)
SVEM Significance Test p-value:
[1] 0.8968502

# Plot the Mahalanobis distances
plot(test_result,test_result2,test_result3)
Whole Model Test Results for Example 2

Whole Model Test Results for Example 2

Simulations to select SVEMnet settings

There are many particular scenarios that we might be interested in focusing on in order to optimize SVEMnet settings. Perhaps a certain number of factors with a certain number of interactions, etc. However, when setting a default for a software, we want it to work well over a wide range of scenarios that might be encountered.

Our simulations target a response surface model in p factors. For a selected density \(d\in[0,1]\), n_active <- max(1, floor(p * d)) of the \(\frac{(p+1)(p+2)}{2}\) parameters in the RSM are set to rexp(1)-rexp(1). There are n points in the Latin hypercube design. This is not an endorsement of the Latin hypercube method for designed experiments: it is merely used as a quick way to generate space filling points for the simulation. It would also be possible to run the simulation using optimal designs or other space filling approaches (such as Fast Flexible Filling, Jones and Lekivetz (2014)). However, for supersaturated settings (where \(n<\frac{(p+1)(p+2)}{2}\)) the optimal designs would require additional work to specify, and that is not needed for this simulation.

The models are trained on n observations and compared to an independent test set with n_holdout observations.

# Define vectors for p, d, n, sd
p_values <- seq(3, 6, 1)        # Number of parameters
d_values <- seq(0.1, 0.9, 0.1)  # Density (proportion of active parameters)
n_values <- seq(15, 50, 5)      # Number of design points
sd_values <- c(.25,0.5, 1, 1.5)       # Standard deviations of noise

nSim <- 20                  # Number of simulations per setting
n_holdout <- 1000               # Number of holdout points

# Create a grid of all combinations of p, d, n, sd
param_grid <- expand.grid(p = p_values, d = d_values, n = n_values, sd = sd_values)

Simulation 1

First we compare the log root mean squared error (LRMSE) on the holdout set for four different models corresponding to the combinations of objective={"wAIC","wSSE"} and debias={TRUE,FALSE}. Lemkus (2021) uses objective={"wSSE"} and debias=FALSE. JMP uses objective={"wSSE"} and debias=TRUE. Based on the simulations below, SVEMnet defaults to objective={"wAIC"} and debias=FALSE. Note that this is not a commentary on JMP’s settings or a statement about globally optimal SVEM settings. These are simply the combinations that SVEMnet seems to work best with over the tested scenarios.

LRMSE for {debias}x{objective}

LRMSE for {debias}x{objective}

The next plot shows the residuals of LRMSE after removing the mean LRMSE of the four models over each simulation. This generates a paired comparison. Notice that the model using objective="wAIC" and debias=FALSE performs best in SVEMnet.
Paired LRMSE for {debias}x{objective}

Paired LRMSE for {debias}x{objective}

The script for this simulation is available below. Note that this script was generated with GPT o1-preview. The first plot shows the test LRMSE for each of the four models.

# Load necessary libraries
library(SVEMnet)
library(lhs)
library(ggplot2)
library(dplyr)
library(pbapply)
library(parallel)

# Define vectors for p, d, n, sd
p_values <- seq(3, 6, 1)        # Number of parameters
d_values <- seq(0.1, 0.9, 0.1)  # Density (proportion of active parameters)
n_values <- seq(15, 50, 5)      # Number of design points
sd_values <- c(.25,0.5, 1, 1.5)       # Standard deviations of noise

nSim <- 20                  # Number of simulations per setting
n_holdout <- 1000               # Number of holdout points

# Create a grid of all combinations of p, d, n, sd
param_grid <- expand.grid(p = p_values, d = d_values, n = n_values, sd = sd_values)

# Prepare a list of simulation parameters
sim_params_list <- list()
sim_id <- 1  # Initialize simulation ID

for (i in 1:nrow(param_grid)) {
  p <- param_grid$p[i]
  d <- param_grid$d[i]
  n <- param_grid$n[i]
  sd <- param_grid$sd[i]

  for (sim in 1:nSim) {
    sim_params_list[[length(sim_params_list) + 1]] <- list(
      sim_id = sim_id,
      p = p,
      d = d,
      n = n,
      sd = sd
    )
    sim_id <- sim_id + 1
  }
}

total_iterations <- length(sim_params_list)
cat("Total simulations to run:", total_iterations, "\n")

# Set up parallel backend using 'parallel' package
num_cores <- 8
RNGkind("L'Ecuyer-CMRG")
set.seed(1234)
cl <- makeCluster(num_cores)
clusterEvalQ(cl, {
  library(SVEMnet)
  library(lhs)
  library(dplyr)
})

# Export necessary variables to the cluster
clusterExport(cl, varlist = c("n_holdout"))

# Set up RNG for parallel processing

clusterSetRNGStream(cl, 1234)

# Function to perform one simulation for a given setting
simulate_one <- function(sim_params) {
  sim_id <- sim_params$sim_id
  p <- sim_params$p
  d <- sim_params$d
  n <- sim_params$n
  sd <- sim_params$sd

  # Generate design data X using LHS
  X <- randomLHS(n, p)
  colnames(X) <- paste0("V", 1:p)

  # Generate holdout data X_T using LHS
  X_T <- randomLHS(n_holdout, p)
  colnames(X_T) <- paste0("V", 1:p)

  # Select active parameters
  n_active <- max(1, floor(p * d))
  active_params <- sample(1:p, n_active)

  # Generate coefficients
  beta <- numeric(p)
  beta[active_params] <- rexp(n_active) - rexp(n_active)

  # Generate response variable y
  y <- X %*% beta + rnorm(n, mean = 0, sd = sd)
  y <- as.numeric(y)  # Convert to vector

  # Compute true y for holdout data X_T
  y_T_true <- X_T %*% beta
  y_T_true <- as.numeric(y_T_true)

  # Define formula for SVEMnet
  data_train <- data.frame(y = y, X)
  data_holdout <- data.frame(X_T)
  colnames(data_holdout) <- colnames(X)

  # Include interactions and quadratic terms
  formula <- as.formula(paste(
    "y ~ (", paste(colnames(X), collapse = " + "), ")^2 + ",
    paste("I(", colnames(X), "^2)", collapse = " + ")
  ))

  # Initialize data frame to store results for this simulation
  sim_results <- data.frame()

  # Loop over objectives and debias options
  for (objective in c("wAIC", "wSSE")) {
    for (debias in c(TRUE, FALSE)) {
      # Fit SVEMnet model
      model <- SVEMnet(
        formula = formula,
        data = data_train,
        objective = objective,
        nBoot = 200,
        glmnet_alpha = c(1),  # Lasso
        weight_scheme = "SVEM"
      )

      # Predict on holdout data
      y_pred_T <- predict(
        object = model,
        newdata = data_holdout,
        debias = debias
      )

      # Compute RMSE over X_T
      RMSE <- sqrt(mean((y_T_true - y_pred_T)^2))
      logRMSE <- log(RMSE)

      # Record results
      sim_results <- rbind(
        sim_results,
        data.frame(
          sim_id = sim_id,
          p = p,
          d = d,
          n = n,
          sd = sd,
          objective = objective,
          debias = debias,
          logRMSE = logRMSE
        )
      )
    }
  }

  return(sim_results)
}

# Run simulations using pblapply with progress bar
results_list <- pblapply(sim_params_list, simulate_one, cl = cl)

# Stop the cluster
stopCluster(cl)

# Combine all results into a data frame
results <- bind_rows(results_list)

# Convert sim_id to integer (since it may come back as a factor)
results$sim_id <- as.integer(as.character(results$sim_id))

# Compute the mean logRMSE for each simulation ID
results <- results %>%
  group_by(sim_id) %>%
  mutate(sim_mean_logRMSE = mean(logRMSE)) %>%
  ungroup()

# Compute residuals
results <- results %>%
  mutate(residual_logRMSE = logRMSE - sim_mean_logRMSE)

# Create a new variable that concatenates 'objective' and 'debias'
results <- results %>%
  mutate(obj_debias = paste0(objective, "_", debias))

# Compute average residuals for each combination of 'obj_debias'
average_residuals <- results %>%
  group_by(obj_debias) %>%
  summarise(mean_residual_logRMSE = mean(residual_logRMSE),
            .groups = 'drop')

# Print average residuals
print(average_residuals)

# Plot residuals side by side for the four combinations
# Set factor levels for desired order
results$obj_debias <- factor(results$obj_debias,
                             levels = c("wAIC_TRUE", "wAIC_FALSE", "wSSE_TRUE", "wSSE_FALSE"))

ggplot(results, aes(x = obj_debias, y = logRMSE)) +
  geom_boxplot() +
  xlab("Objective_Debias Combination") +
  ylab("LogRMSE") +
  ggtitle("logRMSE over all simulations") +
  theme_minimal()


ggplot(results, aes(x = obj_debias, y = residual_logRMSE)) +
  geom_boxplot() +
  xlab("Objective_Debias Combination") +
  ylab("Residual LogRMSE") +
  ggtitle("Residuals (after removing simulation mean logRMSE)") +
  theme_minimal()

Simulation 2

The second simulation compares performance across the weight_scheme argument of SVEMnet. weight_scheme="Identity" corresponds to the single-shot (traditional) Lasso (when glmnet_alpha=1) fit on the training data. It is fit with nBoot=1. weight_scheme="FWR" corresponds to fractional weight regression (Xu et al. (2020)) and uses the same exponential weights for the training data as weight_scheme="SVEM", but it uses the exact same weights for validation and does not compute anti-correlated validation weights as SVEM does (Lemkus et al. (2021)). SVEM and Identity are used with nBoot=200 and all models are fit with objective="wAIC" and debias=FALSE.

LRMSE for different weight_scheme

LRMSE for different weight_scheme

The next plot shows the residuals of LRMSE after removing the mean LRMSE of the three models over each simulation. This generates a paired comparison. Notice that SVEM outperforms the single-shot AIC lasso and fractional weight regression. It is somewhat surprising that the single-shot AIC lasso outperforms the FWR lasso, but this could have to do with the wide range of settings included in the simulation. For example, when p=6 there are 28 parameters in the RSM, and when d=0.9, 25 of them are active. Some of the simulations include as few as 15 runs, so this is an extreme case of fitting a supersaturated design where a larger-than-expected proportion of the parameters are active. Interested readers are encouraged to modify the simulation code to focus on scenarios of more personal interest, perhaps focusing on less extreme situations.
Residual LRMSE for different weight_scheme

Residual LRMSE for different weight_scheme

The script for this simulation is available below. Note that this script was generated with GPT o1-preview. The first plot shows the test LRMSE for each of the three models.

# Load necessary libraries
library(SVEMnet)
library(lhs)
library(ggplot2)
library(dplyr)
library(pbapply)
library(parallel)
library(tidyr)  # For pivot_wider

# Define vectors for p, d, n, sd
p_values <- seq(3, 6, 1)        # Number of parameters
d_values <- seq(0.1, 0.9, 0.1)  # Density (proportion of active parameters)
n_values <- seq(15, 50, 5)      # Number of design points
sd_values <- c(0.25, 0.5, 1, 1.5)  # Standard deviations of noise

nSim <- 20                  # Number of simulations per setting
n_holdout <- 1000           # Number of holdout points

# Create a grid of all combinations of p, d, n, sd
param_grid <- expand.grid(p = p_values, d = d_values, n = n_values, sd = sd_values)

# Prepare a list of simulation parameters
sim_params_list <- list()
sim_id <- 1  # Initialize simulation ID

for (i in 1:nrow(param_grid)) {
  p <- param_grid$p[i]
  d <- param_grid$d[i]
  n <- param_grid$n[i]
  sd <- param_grid$sd[i]

  for (sim in 1:nSim) {
    sim_params_list[[length(sim_params_list) + 1]] <- list(
      sim_id = sim_id,
      p = p,
      d = d,
      n = n,
      sd = sd
    )
    sim_id <- sim_id + 1
  }
}

total_iterations <- length(sim_params_list)
cat("Total simulations to run:", total_iterations, "\n")

# Set up parallel backend using 'parallel' package
num_cores <- 8
RNGkind("L'Ecuyer-CMRG")
set.seed(0)
cl <- makeCluster(num_cores)
clusterEvalQ(cl, {
  library(SVEMnet)
  library(lhs)
  library(dplyr)
})

# Export necessary variables to the cluster
clusterExport(cl, varlist = c("n_holdout"))

# Set up RNG for parallel processing

clusterSetRNGStream(cl, 0)

# Function to perform one simulation for a given setting
simulate_one <- function(sim_params) {
  sim_id <- sim_params$sim_id
  p <- sim_params$p
  d <- sim_params$d
  n <- sim_params$n
  sd <- sim_params$sd

  # Generate design data X using LHS
  X <- randomLHS(n, p)
  colnames(X) <- paste0("V", 1:p)

  # Generate holdout data X_T using LHS
  X_T <- randomLHS(n_holdout, p)
  colnames(X_T) <- paste0("V", 1:p)

  # Select active parameters
  n_active <- max(1, floor(p * d))
  active_params <- sample(1:p, n_active)

  # Generate coefficients
  beta <- numeric(p)
  beta[active_params] <- rexp(n_active) - rexp(n_active)

  # Generate response variable y
  y <- X %*% beta + rnorm(n, mean = 0, sd = sd)
  y <- as.numeric(y)  # Convert to vector

  # Compute true y for holdout data X_T
  y_T_true <- X_T %*% beta
  y_T_true <- as.numeric(y_T_true)

  # Define formula for SVEMnet
  data_train <- data.frame(y = y, X)
  data_holdout <- data.frame(X_T)
  colnames(data_holdout) <- colnames(X)

  # Include interactions and quadratic terms
  formula <- as.formula(paste(
    "y ~ (", paste(colnames(X), collapse = " + "), ")^2 + ",
    paste("I(", colnames(X), "^2)", collapse = " + ")
  ))

  # Initialize data frame to store results for this simulation
  sim_results <- data.frame()

  # Set debias = FALSE and objective = "wAIC"
  debias <- FALSE
  objective <- "wAIC"

  # Loop over model types
  for (model_type in c("SVEM", "FWR", "Identity")) {
    # Set weight_scheme and nBoot
    if (model_type == "SVEM") {
      weight_scheme <- "SVEM"
      nBoot <- 200
    } else if (model_type == "FWR") {
      weight_scheme <- "FWR"
      nBoot <- 200
    } else if (model_type == "Identity") {
      weight_scheme <- "Identity"
      nBoot <- 1
    }

    # Fit SVEMnet model
    model <- SVEMnet(
      formula = formula,
      data = data_train,
      objective = objective,
      nBoot = nBoot,
      glmnet_alpha = c(1),  # Lasso
      weight_scheme = weight_scheme
    )

    # Predict on holdout data
    y_pred_T <- predict(
      object = model,
      newdata = data_holdout,
      debias = debias
    )

    # Compute RMSE over X_T
    RMSE <- sqrt(mean((y_T_true - y_pred_T)^2))
    logRMSE <- log(RMSE)

    # Record results
    sim_results <- rbind(
      sim_results,
      data.frame(
        sim_id = sim_id,
        p = p,
        d = d,
        n = n,
        sd = sd,
        model_type = model_type,
        logRMSE = logRMSE
      )
    )
  }

  return(sim_results)
}

# Run simulations using pblapply with progress bar
results_list <- pblapply(sim_params_list, simulate_one, cl = cl)

# Stop the cluster
stopCluster(cl)

# Combine all results into a data frame
results <- bind_rows(results_list)

# Convert sim_id to integer
results$sim_id <- as.integer(as.character(results$sim_id))

# Compute the mean logRMSE for each simulation ID
results <- results %>%
  group_by(sim_id) %>%
  mutate(sim_mean_logRMSE = mean(logRMSE)) %>%
  ungroup()

# Compute residuals
results <- results %>%
  mutate(residual_logRMSE = logRMSE - sim_mean_logRMSE)

# Compute average residuals for each model_type
average_residuals <- results %>%
  group_by(model_type) %>%
  summarise(mean_residual_logRMSE = mean(residual_logRMSE),
            .groups = 'drop')

# Print average residuals
print(average_residuals)

# Plot residuals side by side for the three models
results$model_type <- factor(results$model_type, levels = c("SVEM", "FWR", "Identity"))

ggplot(results, aes(x = model_type, y = logRMSE)) +
  geom_boxplot() +
  xlab("Model Type") +
  ylab("LogRMSE") +
  ggtitle("LogRMSE by Model") +
  theme_minimal()

ggplot(results, aes(x = model_type, y = residual_logRMSE)) +
  geom_boxplot() +
  xlab("Model Type") +
  ylab("Residual LogRMSE") +
  ggtitle("Residuals (after removing simulation mean logRMSE)") +
  theme_minimal()

# Compute paired differences between models for each simulation
results_wide <- results %>%
  select(sim_id, model_type, logRMSE) %>%
  pivot_wider(names_from = model_type, values_from = logRMSE)

# Compute differences
results_wide <- results_wide %>%
  mutate(
    diff_SVEM_FWR = SVEM - FWR,
    diff_SVEM_Identity = SVEM - Identity,
    diff_FWR_Identity = FWR - Identity
  )

# Summarize differences
differences_summary <- results_wide %>%
  summarise(
    mean_diff_SVEM_FWR = mean(diff_SVEM_FWR, na.rm = TRUE),
    mean_diff_SVEM_Identity = mean(diff_SVEM_Identity, na.rm = TRUE),
    mean_diff_FWR_Identity = mean(diff_FWR_Identity, na.rm = TRUE),
    sd_diff_SVEM_FWR = sd(diff_SVEM_FWR, na.rm = TRUE),
    sd_diff_SVEM_Identity = sd(diff_SVEM_Identity, na.rm = TRUE),
    sd_diff_FWR_Identity = sd(diff_FWR_Identity, na.rm = TRUE)
  )

print(differences_summary)

# Optionally, perform paired t-tests
t_test_SVEM_FWR <- t.test(results_wide$SVEM, results_wide$FWR, paired = TRUE)
t_test_SVEM_Identity <- t.test(results_wide$SVEM, results_wide$Identity, paired = TRUE)
t_test_FWR_Identity <- t.test(results_wide$FWR, results_wide$Identity, paired = TRUE)

# Print t-test results
print(t_test_SVEM_FWR)
print(t_test_SVEM_Identity)
print(t_test_FWR_Identity)

21DEC2024: Add glmnet.cv wrapper

This example shows the newly added wrapper for cv.glmnet() to compare performance of SVEM to glmnet’s native CV implementation. In this example, the factors are mixture factors (generated with rdirichlet()).

In this simulation, both SVEM (with objective=wAIC) and cv.glmnet outperform the single-shot elastic net using AIC. The cv.glmnet function is slightly outperforming SVEM. This raises a question of when it is better to use SVEMnet(...,objective="wAIC") and when it is better to use cv.glmnet. It is again suggested that debiasing harms the predictive performance on holdout data.

The script for this simulation is available below. Note that this script was generated with GPT o1-preview. The plot shows the residual test LRMSE for each of the models (after subtracting the mean teast LRMSE of each of the models).
Residual LRMSE for different modeling approaches

Residual LRMSE for different modeling approaches

# Install if not already installed
# install.packages("gtools")

library(SVEMnet)
library(ggplot2)
library(dplyr)
library(parallel)
library(tidyr)
library(gtools)  # for rdirichlet

# Define vectors for p, d, n, sd, and outlier_prop
p_values <- seq(4, 7, 1)                # Number of parameters (mixture components)
d_values <- seq(.1, .9, .1)             # Density (proportion of terms to be active)
n_values <- seq(20, 70, 10)            # Number of design points
sd_values <- c(0.25, 0.5, 1)       # Standard deviations of noise
outlier_prop_values <- c(0,0.1)    # Proportions of outliers

nSim <- 5           # Number of simulations per setting
n_holdout <- 2000         # Number of holdout points
mult <- 200               # Oversampling multiple for candidate set generation

# Create a grid of all combinations of p, d, n, sd, outlier_prop
param_grid <- expand.grid(
  p = p_values,
  d = d_values,
  n = n_values,
  sd = sd_values,
  outlier_prop = outlier_prop_values
)

# Prepare a list of simulation parameters
sim_params_list <- list()
sim_id <- 1

for (i in 1:nrow(param_grid)) {
  p <- param_grid$p[i]
  d <- param_grid$d[i]
  n <- param_grid$n[i]
  sd <- param_grid$sd[i]
  outlier_prop <- param_grid$outlier_prop[i]

  for (sim in 1:nSim) {
    sim_params_list[[length(sim_params_list) + 1]] <- list(
      sim_id = sim_id,
      p = p,
      d = d,
      n = n,
      sd = sd,
      outlier_prop = outlier_prop
    )
    sim_id <- sim_id + 1
  }
}

total_iterations <- length(sim_params_list)
cat("Total simulations to run:", total_iterations, "\n")

# Set up parallel backend
num_cores <- 10
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
cl <- makeCluster(num_cores)
clusterSetRNGStream(cl, 1)
clusterEvalQ(cl, {
  library(SVEMnet)
  library(dplyr)
  library(gtools)
})

clusterExport(cl, varlist = c("n_holdout","mult"))

simulate_one <- function(sim_params) {
  sim_id <- sim_params$sim_id
  p <- sim_params$p
  d <- sim_params$d
  n <- sim_params$n
  sd <- sim_params$sd
  outlier_prop <- sim_params$outlier_prop

  # Step 1: Generate a large candidate set of points in the simplex
  N <- n * mult
  candidates <- rdirichlet(N, rep(1, p))

  # Step 2: k-means to get n cluster centroids as final design
  km_result <- kmeans(candidates, centers = n, nstart = 10)

  # Final mixture design
  X <- km_result$centers
  colnames(X) <- paste0("V", 1:p)

  # Use a fourth-order model: (X1 + X2 + ... + Xp)^4
  formula_str <- paste("y ~ (", paste(colnames(X), collapse = " + "), ")^",as.character(3))
  formula <- as.formula(formula_str)

  #model formula
  formula_str_model <- paste("y ~ (", paste(colnames(X), collapse = " + "), ")^3")
  formula_model <- as.formula(formula_str)

  # Temporary data frame with y=0 just to build model matrix
  data_train <- data.frame(y = 0, X)
  mf <- model.frame(formula, data_train)
  X_full <- model.matrix(formula, mf)

  # Count how many terms (excluding intercept)
  M <- ncol(X_full) - 1
  n_active <- max(1, floor(M * d))

  # Select active terms from all non-intercept terms
  active_terms <- sample(2:(M+1), n_active, replace = FALSE)

  # Create coefficients
  beta <- numeric(M+1)
  beta[active_terms] <- rexp(n_active) - rexp(n_active)

  # Generate response y
  data_train$y <- drop(X_full %*% beta + rnorm(n, mean = 0, sd = sd))

  # Introduce outliers
  num_outliers <- ceiling(outlier_prop * n)
  if (num_outliers > 0) {
    outlier_indices <- sample(seq_len(n), num_outliers, replace = FALSE)
    data_train$y[outlier_indices] <- data_train$y[outlier_indices] + rnorm(num_outliers, mean = 0, sd = 3 * sd)
  }

  # Holdout data
  X_T <- rdirichlet(n_holdout, rep(1, p))
  colnames(X_T) <- paste0("V", 1:p)
  data_holdout <- data.frame(X_T)

  # Remove response for holdout
  terms_obj <- terms(formula, data = data_train)
  terms_obj_no_resp <- delete.response(terms_obj)
  mf_T <- model.frame(terms_obj_no_resp, data_holdout)
  X_T_full <- model.matrix(terms_obj_no_resp, mf_T)
  y_T_true <- drop(X_T_full %*% beta)

  # Fit models
  model_svem <- SVEMnet(
    formula = formula_model,
    data = data_train,
    standardize = TRUE,
    objective = "wAIC",
    nBoot = 200,
    glmnet_alpha = c(0,0.5,1),
    weight_scheme = "SVEM"
  )

  model_glmnet_cv <- glmnet_with_cv(
    formula = formula_model,
    data = data_train,
    glmnet_alpha = c(0,0.5,1),
    standardize = TRUE
  )

  model_svem_identity <- SVEMnet(
    formula = formula_model,
    data = data_train,
    objective = "wAIC",
    standardize=TRUE,
    nBoot = 1,
    glmnet_alpha = c(0,0.5,1),
    weight_scheme = "Identity"
  )

  pred_list <- list()

  # SVEMnet debias=FALSE
  y_pred_svem_false <- predict(model_svem, newdata = data_holdout, debias = FALSE)
  RMSE_svem_false <- sqrt(mean((y_T_true - y_pred_svem_false)^2))
  logRMSE_svem_false <- log(RMSE_svem_false)
  pred_list[[length(pred_list)+1]] <- data.frame(
    sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
    model = "SVEMnet", debias = FALSE, logRMSE = logRMSE_svem_false
  )

  # SVEMnet debias=TRUE
  y_pred_svem_true <- predict(model_svem, newdata = data_holdout, debias = TRUE)
  RMSE_svem_true <- sqrt(mean((y_T_true - y_pred_svem_true)^2))
  logRMSE_svem_true <- log(RMSE_svem_true)
  pred_list[[length(pred_list)+1]] <- data.frame(
    sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
    model = "SVEMnet", debias = TRUE, logRMSE = logRMSE_svem_true
  )

  # glmnet_cv debias=FALSE
  y_pred_glmnet_false <- predict_cv(model_glmnet_cv, newdata = data_holdout, debias = FALSE)
  RMSE_glmnet_false <- sqrt(mean((y_T_true - y_pred_glmnet_false)^2))
  logRMSE_glmnet_false <- log(RMSE_glmnet_false)
  pred_list[[length(pred_list)+1]] <- data.frame(
    sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
    model = "glmnet_cv", debias = FALSE, logRMSE = logRMSE_glmnet_false
  )

  # glmnet_cv debias=TRUE
  y_pred_glmnet_true <- predict_cv(model_glmnet_cv, newdata = data_holdout, debias = TRUE)
  RMSE_glmnet_true <- sqrt(mean((y_T_true - y_pred_glmnet_true)^2))
  logRMSE_glmnet_true <- log(RMSE_glmnet_true)
  pred_list[[length(pred_list)+1]] <- data.frame(
    sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
    model = "glmnet_cv", debias = TRUE, logRMSE = logRMSE_glmnet_true
  )

  # SVEMnet_Identity debias=FALSE
  y_pred_svem_identity_false <- predict(model_svem_identity, newdata = data_holdout, debias = FALSE)
  RMSE_svem_identity_false <- sqrt(mean((y_T_true - y_pred_svem_identity_false)^2))
  logRMSE_svem_identity_false <- log(RMSE_svem_identity_false)
  pred_list[[length(pred_list)+1]] <- data.frame(
    sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
    model = "SVEMnet_Identity", debias = FALSE, logRMSE = logRMSE_svem_identity_false
  )

  # SVEMnet_Identity debias=TRUE
  y_pred_svem_identity_true <- predict(model_svem_identity, newdata = data_holdout, debias = TRUE)
  RMSE_svem_identity_true <- sqrt(mean((y_T_true - y_pred_svem_identity_true)^2))
  logRMSE_svem_identity_true <- log(RMSE_svem_identity_true)
  pred_list[[length(pred_list)+1]] <- data.frame(
    sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
    model = "SVEMnet_Identity", debias = TRUE, logRMSE = logRMSE_svem_identity_true
  )

  sim_results <- bind_rows(pred_list)
  return(sim_results)
}

# Run load-balanced parallel computations
results_list <- parLapplyLB(cl, sim_params_list, simulate_one)
stopCluster(cl)

results <- bind_rows(results_list)

results <- results %>%
  group_by(sim_id) %>%
  mutate(sim_mean_logRMSE = mean(logRMSE)) %>%
  ungroup() %>%
  mutate(residual_logRMSE = logRMSE - sim_mean_logRMSE)

average_residuals <- results %>%
  group_by(model, debias) %>%
  summarise(mean_residual_logRMSE = mean(residual_logRMSE), .groups = 'drop')

print(average_residuals)

results$model_debias <- paste0(results$model, "_debias=", results$debias)
results$model_debias <- factor(results$model_debias,
                               levels = c("SVEMnet_debias=FALSE", "SVEMnet_debias=TRUE",
                                          "glmnet_cv_debias=FALSE", "glmnet_cv_debias=TRUE",
                                          "SVEMnet_Identity_debias=FALSE", "SVEMnet_Identity_debias=TRUE"))

# Your ggplot calls:
ggplot(results, aes(y = model_debias, x = logRMSE, fill = as.factor(debias))) +
  geom_boxplot() +
  ylab("Model & Debias Setting") +
  xlab("LogRMSE") +
  ggtitle("LogRMSE by Model and Debias Setting") +
  theme_minimal() +
  coord_flip() +
  scale_fill_discrete(name = "Debias", labels = c("FALSE", "TRUE"))

ggplot(results, aes(y = model_debias, x = residual_logRMSE, fill = as.factor(debias))) +
  geom_boxplot() +
  ylab("Model & Debias Setting") +
  xlab("Residual LogRMSE") +
  ggtitle("Residuals (after removing simulation mean)") +
  theme_minimal() +
  coord_flip() +
  scale_fill_discrete(name = "Debias", labels = c("FALSE", "TRUE"))

results_wide <- results %>%
  dplyr::select(sim_id, model_debias, logRMSE) %>%
  pivot_wider(names_from = model_debias, values_from = logRMSE)

results_wide <- results_wide %>%
  mutate(
    diff_SVEM_wAIC_debiasTRUE_FALSE = `SVEMnet_debias=TRUE` - `SVEMnet_debias=FALSE`,
    diff_glmnet_cv_debiasTRUE_FALSE = `glmnet_cv_debias=TRUE` - `glmnet_cv_debias=FALSE`,
    diff_SVEM_glmnet_cv_FALSE = `SVEMnet_debias=FALSE` - `glmnet_cv_debias=FALSE`,
    diff_SVEM_glmnet_cv_TRUE  = `SVEMnet_debias=TRUE` - `glmnet_cv_debias=TRUE`
  )

differences_summary <- results_wide %>%
  summarise(
    mean_diff_SVEM_debias = mean(diff_SVEM_wAIC_debiasTRUE_FALSE, na.rm = TRUE),
    sd_diff_SVEM_debias = sd(diff_SVEM_wAIC_debiasTRUE_FALSE, na.rm = TRUE),
    mean_diff_glmnet_debias = mean(diff_glmnet_cv_debiasTRUE_FALSE, na.rm = TRUE),
    sd_diff_glmnet_debias = sd(diff_glmnet_cv_debiasTRUE_FALSE, na.rm = TRUE),
    mean_diff_SVEM_glmnet_F = mean(diff_SVEM_glmnet_cv_FALSE, na.rm = TRUE),
    sd_diff_SVEM_glmnet_F = sd(diff_SVEM_glmnet_cv_FALSE, na.rm = TRUE),
    mean_diff_SVEM_glmnet_T = mean(diff_SVEM_glmnet_cv_TRUE, na.rm = TRUE),
    sd_diff_SVEM_glmnet_T = sd(diff_SVEM_glmnet_cv_TRUE, na.rm = TRUE)
  )

print(differences_summary)

t_test_SVEM_debias <- t.test(results_wide$`SVEMnet_debias=TRUE`, results_wide$`SVEMnet_debias=FALSE`, paired = TRUE)
t_test_glmnet_debias <- t.test(results_wide$`glmnet_cv_debias=TRUE`, results_wide$`glmnet_cv_debias=FALSE`, paired = TRUE)
t_test_SVEM_vs_glmnet_FALSE <- t.test(results_wide$`SVEMnet_debias=FALSE`, results_wide$`glmnet_cv_debias=FALSE`, paired = TRUE)
t_test_SVEM_vs_glmnet_TRUE  <- t.test(results_wide$`SVEMnet_debias=TRUE`, results_wide$`glmnet_cv_debias=TRUE`, paired = TRUE)

print(t_test_SVEM_debias)
print(t_test_glmnet_debias)
print(t_test_SVEM_vs_glmnet_FALSE)
print(t_test_SVEM_vs_glmnet_TRUE)

References and Citations

  1. Lemkus, T., Gotwalt, C., Ramsey, P., & Weese, M. L. (2021). Self-Validated Ensemble Models for Elastic Net Regression.
    Chemometrics and Intelligent Laboratory Systems, 219, 104439.
    DOI: 10.1016/j.chemolab.2021.104439

  2. Karl, A. T. (2024). A Randomized Permutation Whole-Model Test for SVEM.
    Chemometrics and Intelligent Laboratory Systems, 249, 105122.
    DOI: 10.1016/j.chemolab.2024.105122

  3. Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent.
    Journal of Statistical Software, 33(1), 1–22.
    DOI: 10.18637/jss.v033.i01

  4. Gotwalt, C., & Ramsey, P. (2018). Model Validation Strategies for Designed Experiments Using Bootstrapping Techniques With Applications to Biopharmaceuticals.
    JMP Discovery Conference.
    Link

  5. Ramsey, P., Gaudard, M., & Levin, W. (2021). Accelerating Innovation with Space-Filling Mixture Designs, Neural Networks, and SVEM.
    JMP Discovery Conference.
    Link

  6. Ramsey, P., & Gotwalt, C. (2018). Model Validation Strategies for Designed Experiments Using Bootstrapping Techniques With Applications to Biopharmaceuticals.
    JMP Discovery Summit Europe.
    Link

  7. Ramsey, P., Levin, W., Lemkus, T., & Gotwalt, C. (2021). SVEM: A Paradigm Shift in Design and Analysis of Experiments.
    JMP Discovery Summit Europe.
    Link

  8. Ramsey, P., & McNeill, P. (2023). CMC, SVEM, Neural Networks, DOE, and Complexity: It’s All About Prediction.
    JMP Discovery Conference.

  9. Karl, A., Wisnowski, J., & Rushing, H. (2022). JMP Pro 17 Remedies for Practical Struggles with Mixture Experiments.
    JMP Discovery Conference.
    Link

  10. Xu, L., Gotwalt, C., Hong, Y., King, C. B., & Meeker, W. Q. (2020). Applications of the Fractional-Random-Weight Bootstrap.
    The American Statistician, 74(4), 345–358.
    Link

  11. Karl, A. T. (2024). SVEMnet: Self-Validated Ensemble Models with Elastic Net Regression.
    R package version 1.1.1.

  12. JMP Help Documentation Overview of Self-Validated Ensemble Models.
    Link