---
title: "Advanced functionality"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Advanced functionality}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(SimEngine)
```

## Complex simulation levels

Often, simulation levels will be simple, such as a vector of sample sizes:

```{r}
sim <- new_sim()
sim %<>% set_levels(n = c(200,400,800))
```

However, there are many instances in which more complex objects are needed. For these cases, instead of a vector of numbers or character strings, a named list of lists can be used. The toy example below illustrates this.

```{r}
sim <- new_sim()
sim %<>% set_levels(
  n = c(10,100),
  distribution = list(
    "Beta 1" = list(type="Beta", params=c(0.3, 0.7)),
    "Beta 2" = list(type="Beta", params=c(1.5, 0.4)),
    "Normal" = list(type="Normal", params=c(3.0, 0.2))
  )
)
create_data <- function(n, type, params) {
  if (type=="Beta") {
    return(rbeta(n, shape1=params[1], shape2=params[2]))
  } else if (type=="Normal") {
    return(rnorm(n, mean=params[1], sd=params[2]))
  }
}
sim %<>% set_script(function() {
  x <- create_data(L$n, L$distribution$type, L$distribution$params)
  return(list("y"=mean(x)))
})
sim %<>% run()
```

Note that the list names (`"Beta 1"`, `"Beta 2"`, and `"Normal"`) become the entries in the `sim$results` dataframe, as well as the dataframe returned by `summarize()`.

```{r}
sim %>% summarize(list(stat="mean", x="y"))
```

Sometimes, it may be the case that you want to run simulations only for a subset of level combinations. In these cases, use the `.keep` option of `set_levels()` . First, set the levels normally. Second, view the `sim$levels_grid` dataframe to examine the level combinations and the associated `level_id` values. Third, call `set_levels()` again with the `.keep` option to specify which level combinations to keep. This is demonstrated below:

```{r}
sim <- new_sim()
sim %<>% set_levels(alpha=c(2,3,4), beta=c(50,60))
print(sim$levels_grid)
sim %<>% set_levels(.keep=c(1,2,6))
print(sim$levels_grid)
```

## Complex return data

In most situations, the results of simulations are numeric. However, we may want to return more complex data, such as matrices, lists, or model objects. To do this, we add a key-value pair to the list returned by the simulation script with the special key `".complex"` and a list (containing the complex data) as the value. This is illustrated in the toy example below. Here, the simulation script estimates the parameters of a linear regression and returns these as numeric, but also returns the estimated covariance matrix and the entire model object.

```{r}
sim <- new_sim()
sim %<>% set_levels(n=c(10, 100, 1000))
create_data <- function(n) {
  x <- runif(n)
  y <- 3 + 2*x + rnorm(n)
  return(data.frame("x"=x, "y"=y))
}
sim %<>% set_config(num_sim=2)
sim %<>% set_script(function() {
  dat <- create_data(L$n)
  model <- lm(y~x, data=dat)
  return(list(
    "beta0_hat" = model$coefficients[[1]],
    "beta1_hat" = model$coefficients[[2]],
    ".complex" = list(
      "model" = model,
      "cov_mtx" = vcov(model)
    )
  ))
})
sim %<>% run()
```

After running this simulation, the numeric results can be accessed directly via `sim$results` or using the `summarize()` function, as usual:

```{r}
head(sim$results)
```

To examine the complex return data, we can use the function `get_complex()`, as illustrated below:

```{r}
c5 <- get_complex(sim, sim_uid=5)
print(summary(c5$model))
print(c5$cov_mtx)
```

## Using the `batch` function

The `batch()` function is useful for sharing data or objects between simulation replicates. Essentially, it allows simulation replicates to be divided into "batches"; all replicates in a given batch will then share a certain set of objects. A common use case for this is a simulation that involves using multiple methods to analyze a shared dataset, and repeating this process over a number of dataset replicates. This may be of interest if, for example, it is computationally expensive to generate a simulated dataset.

To illustrate the use of `batch()` using this example, we first consider the following simulation:

```{r}
sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3)
sim %<>% set_script(function() {
  dat <- create_data(n=100)
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()
```

From the `"dat_1"` column of the results object (equal to the first element of the `dat` vector created in the simulation script), we see that a unique dataset was created for each simulation replicate:

```{r}
sim$results[order(sim$results$rep_id),c(1:7)!=5]
```

Suppose that instead, we wish to analyze each simulated dataset using multiple methods (in this case corresponding to `"est_mean"` and `"est_median"`), and repeat this procedure a total of three times. We can do this using the `batch()` function, as follows:

```{r}
sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3, batch_levels=NULL)
sim %<>% set_script(function() {
  batch({
    dat <- create_data(n=100)
  })
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()
```

In the code above, we changed two things. First, we added `batch_levels=NULL` to the `set_config()` call; this will be explained below. Second, we wrapped the code line `dat <- create_data(n=100)` inside the `batch()` function. Whatever code goes inside the `batch()` function will produce the same output for all simulations in a batch.

```{r}
sim$results[order(sim$results$rep_id),c(1:7)!=5]
```

In this case, from the `"dat_1"` column of the results object, we see that one dataset was created and shared by the batch corresponding to `sim_uids` 1 and 2 (likewise for `sim_uids` {3,5} and {4,6}).

However, the situation is often more complicated. Suppose we have a simulation with multiple levels, some that correspond to creating data and some that correspond to analyzing the data. Here, the `batch_levels` argument of `set_config()` plays a role. Specifically, this argument should be a character vector equal to the names of the simulation levels that are referenced (via the special variable `L`) from within a `batch()` block. In the example below, the levels `n` and `mu` are used within the `batch()` call, while the level `est` is not.

```{r}
sim <- new_sim()
create_data <- function(n, mu) { rnorm(n=n, mean=mu) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(n=c(10,100), mu=c(3,5), est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=2, batch_levels=c("n", "mu"), return_batch_id=T)
sim %<>% set_script(function() {
  batch({
    dat <- create_data(n=L$n, mu=L$mu)
  })
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()

sim$results[order(sim$results$batch_id),c(1:10)!=8]
```

The batches were created such that each batch contained two replicates, one for each level value of `est`. For expository purposes, we also specified the `return_batch_id=T` option in `set_config()` so that the results object would return the `batch_id`. This is not necessary in practice. The `batch_id` variable defines the batches; all simulations that share the same `batch_id` are in a single batch. The `return_batch_id=T` option can be useful to ensure correct usage of the `batch()` function.

We note the following about the `batch()` function:

* The code within the `batch()` code block must *only* create objects; this code should not change or delete existing objects, as these changes will be ignored.
* In the majority of cases, the `batch()` function will be called just once, at the beginning of the simulation script. However, it can be used anywhere in the script and can be called multiple times. The `batch()` function should never be used outside of the simulation script.
* Although we have illustrated the use of the `batch()` function to create a dataset to share between multiple simulation replicates, it can be used for much more, such as taking a sample from an existing dataset or computing shared nuisance function estimators.
* If the simulation is being run in parallel (either locally or on a CCS), `n_cores` cannot exceed the number of batches, since all simulations within a batch must run on the same core.
* If the simulation script uses the `batch()` function, the simulation cannot be updated using the `update_sim()` or `update_sim_on_cluster()` functions, with the exception of updates that only entail removing simulation replicates.

## Random number generator seeds

In statistical research, it is desirable to be able to reproduce the exact results of a simulation study. Since R code often involves stochastic (random) functions like `rnorm()` or `sample()` that return different values when called multiple times, reproducibility is not guaranteed. In a simple R script, calling the `set.seed()` function at the beginning of the script ensures that the stochastic functions that follow will produce the same results whenever the script is run. However, a more nuanced strategy is needed when running simulations. When running 100 replicates of the same simulation, we do not want each replicate to return identical results; rather, we would like for each replicate to be different from one another, but for *the entire set of replicates* to be the same when the entire simulation is run twice in a row. **SimEngine** manages this process, even when simulations are being run in parallel locally or on a cluster computing system. In **SimEngine**, a single "global seed" is used to generate a different seed for each simulation replicate. The `set_config()` function is used to set or change this global seed:

```{r}
sim %<>% set_config(seed=123)
```

If a seed is not set using `set_config()`, **SimEngine** will set a random seed automatically so that the results can be replicated if desired. To view this seed, we use the `vars()` function:

```{r}
sim <- new_sim()
print(vars(sim, "seed"))
```

## Debugging and error/warning handling

In the simulation coding workflow, errors are inevitable. Some errors may affect all simulation replicates, while other errors may only affect a subset of replicates. By default, when a simulation is run, **SimEngine** will not stop if an error occurs; instead, errors are logged and stored in a dataframe along with information about the simulation replicates that resulted in those errors. Examining this dataframe by typing `print(sim$errors)` can sometimes help to quickly pinpoint the issue. This is demonstrated below:

```{r}
sim <- new_sim()
sim %<>% set_config(num_sim=2)
sim %<>% set_levels(
  Sigma = list(
    s1 = list(mtx=matrix(c(3,1,1,2), nrow=2)),
    s3 = list(mtx=matrix(c(4,3,3,9), nrow=2)),
    s2 = list(mtx=matrix(c(1,2,2,1), nrow=2)),
    s4 = list(mtx=matrix(c(8,2,2,6), nrow=2))
  )
)
sim %<>% set_script(function() {
  x <- MASS::mvrnorm(n=1, mu=c(0,0), Sigma=L$Sigma$mtx)
  return(list(x1=x[1], x2=x[2]))
})
sim %<>% run()
print(sim$errors)
```

From the output above, we see that the code fails for the simulation replicates that use the level with `Sigma="s2"` because it uses an invalid (not positive definite) covariance matrix. Similarly, if a simulation involves replicates that throw warnings, all warnings are logged and stored in the dataframe `sim$warnings`.

The workflow demonstrated above can be useful to pinpoint errors, but it has two main drawbacks. First, it is undesirable to run a time-consuming simulation involving hundreds or thousands of replicates, only to find at the end that every replicate failed because of a typo. It may therefore useful to stop an entire simulation after a single error has occurred. Second, it can sometimes be difficult to determine exactly what caused an error without making use of more advanced debugging tools. For both of these situations, **SimEngine** includes the following configuration option:

```{r}
sim %<>% set_config(stop_at_error=TRUE)
```

Setting `stop_at_error=TRUE` will stop the simulation when it encounters any error. Furthermore, the error will be thrown by R in the usual way, so if the simulation is being run in in RStudio, the built-in debugging tools (such as "Show Traceback" and "Rerun with debug") can be used to find and fix the bug. Placing a call to `browser()` at the top of the simulation script can also be useful for debugging.

## Monte Carlo error

Statistical simulations are often based on the principle of Monte Carlo approximation; specifically, pseudo-random sampling is used to evaluate the performance of a statistical procedure under a particular data-generating process. The performance of the procedure can be viewed as a statistical parameter and, due to the fact that only a finite number of simulation replicates can be performed, there is uncertainty in any estimate of performance. This uncertainty is often referred to as *Monte Carlo error*. We can quantify Monte Carlo error using, for example, the standard error of the performance estimator.

Measuring and reporting Monte Carlo error is a vital component of a simulation study. **SimEngine** includes an option in the `summarize()` function to automatically estimate the Monte Carlo standard error for any inferential summary statistic, e.g., estimator bias or confidence interval coverage. The standard error estimates are based on the formulas provided in Morris et al. 2019. If the option `mc_se` is set to `TRUE`, estimates of Monte Carlo standard error will be included in the summary data frame, along with associated 95% confidence intervals based on a normal approximation.

```{r}
sim <- new_sim()
create_data <- function(n) { rpois(n, lambda=5) }
est_mean <- function(dat) {
  return(mean(dat))
}
sim %<>% set_levels(n=c(10,100,1000))
sim %<>% set_config(num_sim=5)
sim %<>% set_script(function() {
  dat <- create_data(L$n)
  lambda_hat <- est_mean(dat=dat)
  return (list("lambda_hat"=lambda_hat))
})
sim %<>% run()
sim %>% summarize(
  list(stat="mse", name="lambda_mse", estimate="lambda_hat", truth=5), 
  mc_se = TRUE
)
```