Seek computer programs that allow you to do the thinking. – George E. P. Box

Whether you are interested in evaluating the performance of a new optimizer or estimation criteria, re-evaluating previous research claims (e.g., ANOVA is ‘robust’ to violations of normality), determine power rates for an upcoming research proposal, or simply to appease a strange thought in your head about a new statistical idea you heard about, designing Monte Carlo simulations can be incredibly rewarding and are extremely important to those who are statistically oriented. However, organizing simulations can be a challenge, particularly to those new to the topic, where all too often coders resort to the inefficient and error prone strategies (e.g., the dreaded “for-loop” strategy, *for*-ever resulting in confusing, error prone, and simulation specific code). The package `SimDesign`

is one attempt to fix these and other issues that often arise when designing Monte Carlo simulation experiments, while also providing a templated setup that is designed to support many useful features that can be useful when evaluating simulation research.

Generally speaking, Monte Carlo simulations can be broken into three major components:

**generate**your data from some model/probability density function given various**design**conditions to be studied (e.g., sample size, distributions, group sizes, etc),**analyse**the generated data using whatever statistical analyses you are interested in (e.g., \(t\)-test, ANOVA, SEMs, IRT, etc), and collect the statistics/CIs/\(p\)-values/parameter estimates you are interested in, and**summarise**the results after repeating the simulations \(R\) number of times to obtain empirical estimates of the population’s behavior.

Each operation above represents the essential components of the `SimDesign`

package. The **design** component is represented by a `tibble`

-like object containing the simulation conditions to be investigated, while **generate**, **analyse**, and **summarise** represent user-defined functions which comprise the three steps in the simulation. Each of these components are constructed and passed to the `runSimulation()`

function where the simulation steps are evaluated, ultimately returning a `tibble`

-like object containing the simulation results.

After loading the `SimDesign`

package, we begin by defining the required user-constructed functions. To expedite this process, a call to `SimFunctions()`

will create a template to be filled in, where all the necessary functional arguments have been pre-assigned, and only the body of the functions need to be modified. The documentation of each argument can be found in the respective R help files, however there organization is very simple conceptually.

To begin, the following code should be copied and saved to an external source (i.e., text) file.

```
library(SimDesign)
SimFunctions()
```

```
#-------------------------------------------------------------------
library(SimDesign)
Design <- createDesign(factor1 = NA,
factor2 = NA)
#-------------------------------------------------------------------
Generate <- function(condition, fixed_objects = NULL) {
dat <- data.frame()
dat
}
Analyse <- function(condition, dat, fixed_objects = NULL) {
ret <- nc(stat1 = NaN, stat2 = NaN)
ret
}
Summarise <- function(condition, results, fixed_objects = NULL) {
ret <- c(bias = NaN, RMSE = NaN)
ret
}
#-------------------------------------------------------------------
res <- runSimulation(design=Design, replications=1000, generate=Generate,
analyse=Analyse, summarise=Summarise)
res
```

Alternatively, if you are lazy (read: efficient) or just don’t like copy-and-pasting, `SimFunctions()`

can write the output to a file by providing a `filename`

argument. The following creates a file (`mysim.R`

) containing the simulation design/execution and required user-defined functions.

`SimDesign::SimFunctions('mysim')`

For larger simulations, you may want to use two files, and if you’d prefer to have helpful comments included then these can be achieved with the `singlefile`

and `comments`

arguments, respectively.

`SimFunctions('mysim', singlefile = FALSE, comments = TRUE)`

The choice of using a single file or not is entirely a matter of preference, and will not influence the overall simulation implementation. However, should you wish to include separate analysis or generate functions the arguments `nGenerate`

and `nAnalyse`

can be useful to compartmentalize generally distinct portions of the code (e.g., one analyse function for fitting and extracting components of a structural equation model, and one analyse function for fitting and extracting information form an item response theory model).

As a toy example, let’s consider how the following question can be investigated with `SimDesign`

:

*Question*: How does trimming affect recovering the mean of a distribution? Investigate this using different sample sizes with Gaussian and \(\chi^2\) distributions. Also, demonstrate the effect of using the median to recover the mean.

First, define the condition combinations that should be investigated. In this case we wish to study 4 different sample sizes, and use a symmetric and skewed distribution. The use of `createDesign()`

is extremely helpful here to create a completely crossed-design for each combination (there are 8 in total).

```
Design <- createDesign(sample_size = c(30, 60, 120, 240),
distribution = c('norm', 'chi'))
Design
```

```
## # A tibble: 8 × 2
## sample_size distribution
## <dbl> <chr>
## 1 30 norm
## 2 60 norm
## 3 120 norm
## 4 240 norm
## 5 30 chi
## 6 60 chi
## 7 120 chi
## 8 240 chi
```

Each row in `Design`

represents a unique condition to be studied in the simulation. In this case, the first condition to be studied comes from row 1, where \(N=30\) and the distribution should be normal.

We first start by defining the data generation functional component. The only argument accepted by this function is `condition`

, which will always be a *single row from the Design data.frame object* of class `data.frame`

. Conditions are run sequentially from row 1 to the last row in `Design`

. It is also possible to pass a `fixed_objects`

object to the function for including fixed sets of population parameters and other conditions, however for this simple simulation this input is not required.

```
Generate <- function(condition, fixed_objects = NULL) {
N <- condition$sample_size
dist <- condition$distribution
if(dist == 'norm'){
dat <- rnorm(N, mean = 3)
} else if(dist == 'chi'){
dat <- rchisq(N, df = 3)
}
dat
}
```

As we can see above, `Generate()`

will return a numeric vector of length \(N\) containing the data to be analysed each with a population mean of 3 (because a \(\chi^2\) distribution has a mean equal to its df). Next, we define the `analyse`

component to analyse said data:

```
Analyse <- function(condition, dat, fixed_objects = NULL) {
M0 <- mean(dat)
M1 <- mean(dat, trim = .1)
M2 <- mean(dat, trim = .2)
med <- median(dat)
ret <- c(mean_no_trim=M0, mean_trim.1=M1, mean_trim.2=M2, median=med)
ret
}
```

This function accepts the data previously returned from `Generate()`

(`dat`

), the condition vector previously mentioned.

At this point, we may conceptually think of the first two functions as being evaluated independently \(R\) times to obtain \(R\) sets of results. In other words, if we wanted the number of replications to be 100, the first two functions would be independently run (at least) 100 times, the results from `Analyse()`

would be stored, and we would then need to summarise these 100 elements into meaningful meta statistics to describe their empirical properties. This is where computing meta-statistics such as bias, root mean-square error, detection rates, and so on are of primary importance. Unsurprisingly, then, this is the purpose of the `summarise`

component:

```
Summarise <- function(condition, results, fixed_objects = NULL) {
obs_bias <- bias(results, parameter = 3)
obs_RMSE <- RMSE(results, parameter = 3)
ret <- c(bias=obs_bias, RMSE=obs_RMSE, RE=RE(obs_RMSE))
ret
}
```

Again, `condition`

is the same as was defined before, while `results`

is a `matrix`

containing all the results from `Analyse()`

, where each row represents the result returned from each respective replication, and the number of columns is equal to the length of a single vector returned by `Analyse()`

.

That sounds much more complicated than it is — all you really need to know for this simulation is that an \(R\) x 4 matrix called `results`

is available to build a suitable summary from. Because the results is a matrix, `apply()`

is useful to apply a function over each respective row. The bias and RMSE are obtained for each respective statistic, and the overall result is returned as a vector.

Stopping for a moment and thinking carefully, we know that each `condition`

will be paired with a unique vector returned from `Summarise()`

. Therefore, you might be thinking that the result returned from the simulation will be in a rectangular form, such as in a `matrix`

, `data.frame`

, or `tibble`

. Well, you’d be right!

The last stage of the `SimDesign`

work-flow is to pass the four defined elements to the `runSimulation()`

function which, unsurprisingly given it’s name, runs the simulation. There are numerous options available in the function, and these should be investigated by reading the `help(runSimulation)`

HTML file. Options for performing simulations in parallel, storing/resuming temporary results, debugging functions, and so on are available. Below we simply request that each condition be run 1000 times on a single processor, and finally store the results to an object called `results`

.

```
res <- runSimulation(Design, replications = 1000, generate=Generate,
analyse=Analyse, summarise=Summarise)
```

```
##
##
Design row: 1/8; RAM used: 57.5 Mb; Total elapsed time: 0.00s
## Conditions: sample_size=30, distribution=norm
##
##
Design row: 2/8; RAM used: 57.6 Mb; Total elapsed time: 0.30s
## Conditions: sample_size=60, distribution=norm
##
##
Design row: 3/8; RAM used: 57.6 Mb; Total elapsed time: 0.62s
## Conditions: sample_size=120, distribution=norm
##
##
Design row: 4/8; RAM used: 57.6 Mb; Total elapsed time: 0.93s
## Conditions: sample_size=240, distribution=norm
##
##
Design row: 5/8; RAM used: 57.6 Mb; Total elapsed time: 1.27s
## Conditions: sample_size=30, distribution=chi
##
##
Design row: 6/8; RAM used: 57.6 Mb; Total elapsed time: 1.58s
## Conditions: sample_size=60, distribution=chi
##
##
Design row: 7/8; RAM used: 57.6 Mb; Total elapsed time: 1.90s
## Conditions: sample_size=120, distribution=chi
##
##
Design row: 8/8; RAM used: 57.6 Mb; Total elapsed time: 2.23s
## Conditions: sample_size=240, distribution=chi
##
```

`res`

```
## # A tibble: 8 × 19
## sample…¹ distr…² bias.mea…³ bias.mea…⁴ bias.mea…⁵ bias.med…⁶ RMSE.m…⁷ RMSE.m…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 30 norm -4.8080e-3 -4.6834e-3 -2.9265e-3 4.8276e-3 0.18539 0.18976
## 2 60 norm 3.0546e-3 3.9219e-3 3.1080e-3 5.9586e-3 0.12524 0.12865
## 3 120 norm 1.1275e-3 7.8027e-4 5.7227e-4 -3.2579e-4 0.090759 0.093936
## 4 240 norm 9.5261e-4 1.1309e-3 7.0515e-4 -1.5759e-4 0.064267 0.065862
## 5 30 chi 2.2222e-2 -2.9863e-1 -4.4355e-1 -5.7704e-1 0.44753 0.51447
## 6 60 chi -4.2835e-3 -3.4631e-1 -4.9384e-1 -6.3638e-1 0.31925 0.45470
## 7 120 chi -5.0244e-3 -3.4566e-1 -4.8818e-1 -6.3533e-1 0.22368 0.40627
## 8 240 chi -5.8300e-5 -3.4650e-1 -4.8943e-1 -6.3179e-1 0.15787 0.37689
## # … with 11 more variables: RMSE.mean_trim.2 <dbl>, RMSE.median <dbl>,
## # RE.mean_no_trim <dbl>, RE.mean_trim.1 <dbl>, RE.mean_trim.2 <dbl>,
## # RE.median <dbl>, REPLICATIONS <int>, SIM_TIME <chr>, RAM_USED <chr>,
## # SEED <int>, COMPLETED <chr>, and abbreviated variable names ¹sample_size,
## # ²distribution, ³bias.mean_no_trim, ⁴bias.mean_trim.1, ⁵bias.mean_trim.2,
## # ⁶bias.median, ⁷RMSE.mean_no_trim, ⁸RMSE.mean_trim.1
## # ℹ Use `colnames()` to see all variable names
```

As can be seen from the printed results, each result from the `Summarise()`

function has been paired with their respective condition, meta-statistics have been properly named, and three additional columns have been appended to the results: `REPLICATIONS`

, which indicates how many time the conditions were performed, `SIM_TIME`

, indicating the time (in seconds) it took to completely finish the respective conditions, and `SEED`

, which indicates the random seeds used by `SimDesign`

for each condition (for reproducibility). A call to `View()`

in the R console may also be a nice way to sift through the `res`

object.

In this case, visually inspecting the simulation table is enough to understand what is occurring, though for other Monte Carlo simulations use of ANOVAs, marginalized tables, and graphics should be used to capture the essentially phenomenon in the results. Monte Carlo simulations are just like collecting and analysing data for experiments, so my advice would be to put on your analysis hats and present your data as though it were data collected from the real world.

In this particular simulation, it is readily apparent that using the un-adjusted mean will adequately recover the population mean with little bias. The precision also seems to increase as sample sizes increase, which is indicated by the decreasing RMSE statistics. Generally, trimming causes less efficiency in the estimates, where greater amounts of trimming results in even less efficiency, and using the median as a proxy to estimate the mean is the least effective method. This is witnessed rather clearly in the following table, which prints the relative efficiency of the estimators:

```
REs <- res[,grepl('RE\\.', colnames(res))]
data.frame(Design, REs)
```

```
## sample_size distribution RE.mean_no_trim RE.mean_trim.1 RE.mean_trim.2
## 1 30 norm 1 1.0 1.1
## 2 60 norm 1 1.1 1.1
## 3 120 norm 1 1.1 1.2
## 4 240 norm 1 1.1 1.1
## 5 30 chi 1 1.3 1.9
## 6 60 chi 1 2.0 3.2
## 7 120 chi 1 3.3 5.7
## 8 240 chi 1 5.7 10.5
## RE.median
## 1 1.4
## 2 1.6
## 3 1.5
## 4 1.6
## 5 2.8
## 6 5.1
## 7 9.2
## 8 17.1
```

Finally, when the \(\chi^2\) distribution was investigated only the un-adjusted mean accurately portrayed the population mean. This isn’t surprising, because the trimmed mean is, after all, making inferences about the population trimmed mean, and the median is making inferences about, well, the median. Only when the distributions under investigation are symmetric are the statistics able to draw inferences about the same inferences about the mean of the population.

The following is a conceptual breakdown of what `runSimulation()`

is actually doing behind the scenes. Here we demonstrate the results from the first condition (row 1 of `Design`

) to show what each function returns.

A single replication in a Monte Carlo simulation results in the following objects:

`(condition <- Design[1, ])`

```
## # A tibble: 1 × 2
## sample_size distribution
## <dbl> <chr>
## 1 30 norm
```

```
dat <- Generate(condition)
dat
```

```
## [1] 2.37 3.18 2.16 4.60 3.33 2.18 3.49 3.74 3.58 2.69 4.51 3.39 2.38 0.79 4.12
## [16] 2.96 2.98 3.94 3.82 3.59 3.92 3.78 3.07 1.01 3.62 2.94 2.84 1.53 2.52 3.42
```

```
res <- Analyse(condition, dat)
res
```

```
## mean_no_trim mean_trim.1 mean_trim.2 median
## 3.1 3.2 3.2 3.3
```

We can see that `Generate()`

returns a `numeric`

vector which is accepted by `Analyse()`

. The `Analyse()`

function then completes the analysis portion using the generated data, and returns a named vector with the observed parameter estimates. Of course, this is only a single replication, and therefore is not really meaningful in the grand scheme of things; so, it must be repeated a number of times.

```
# repeat 1000x
results <- matrix(0, 1000, 4)
colnames(results) <- names(res)
for(i in 1:1000){
dat <- Generate(condition)
res <- Analyse(condition, dat)
results[i, ] <- res
}
head(results)
```

```
## mean_no_trim mean_trim.1 mean_trim.2 median
## [1,] 3.1 3.1 3.1 2.9
## [2,] 3.1 3.1 3.1 3.1
## [3,] 3.1 3.1 3.0 2.8
## [4,] 2.7 2.6 2.6 2.7
## [5,] 3.2 3.2 3.2 3.0
## [6,] 3.1 3.1 3.0 3.1
```

The matrix stored in `results`

contains 1000 parameter estimates returned from each statistic. After this is obtained, we can move on to summarising the output through the `Summarise()`

function to obtain average estimates, their associated sampling error, their efficiency, and so on.

`Summarise(condition, results) `

```
## bias.mean_no_trim bias.mean_trim.1 bias.mean_trim.2 bias.median
## -0.0011 -0.0031 -0.0035 -0.0037
## RMSE.mean_no_trim RMSE.mean_trim.1 RMSE.mean_trim.2 RMSE.median
## 0.1739 0.1777 0.1859 0.2146
## RE.mean_no_trim RE.mean_trim.1 RE.mean_trim.2 RE.median
## 1.0000 1.0442 1.1425 1.5225
```

This process is then repeated for each row in the `Design`

object until the entire simulation study is complete.

Of course, `runSimulation()`

does much more than this conceptual outline, which is why it exists. Namely, errors and warnings are controlled and tracked, data is re-drawn when needed, parallel processing is supported, debugging is easier with the `debug`

input (or by inserting `browser()`

directly), temporary and full results can be saved to external files, the simulation state can be saved/restored, build-in safety features are included, and more. The point, however, is that you as the user *should not be bogged down with the nitty-gritty details of setting up the simulation work-flow/features*; instead, you should be focusing your time on the important generate-analyse-summarise steps, organized in the body of the above functions, that are required to obtain your interesting simulation results. After all, the point designing a computer simulation experiment is to understand the resulting output, not to become a master of all aspects of your select computing language pertaining to object storage, parallel processing, RAM storage, defensive coding, progress reporting, reproducibility, post-processing, …, ad nauseam.

To access further examples and instructions feel free to visit the package wiki on Github