--- title: "aggregate-pvalue-sim" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{aggregate-pvalue-sim} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: '`r system.file("REFERENCES.bib", package = "ezECM")`' --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The purpose of this vignette is to illustrate how one would use the `pval_gen()` function to simulate p-values and then feed these p-values to the `cECM()` function to illustrate the performance of event categorization techniques. Alternatively, one could use this vignette to learn how to feed sets of real training p-values along with newly generated p-values to `cECM()`. First, let's load the package: ```{r setup} library(ezECM) ``` # Simulating p-values The `pval_gen()` function simulates p-values generated from the two discriminants of event depth and polarity of motion, for the event types of `"earthquake"` and `"explosion"`. Details for the methods for calculating the specific discriminant p-values are shown in @anderson2007mathematical. An example of a call to the `pval_gen()` function is shown below. ```{r pvalgen} simulated.pvalues <- pval_gen(sims = 160, grid.dim = c(1000,1000,30), seismometer = list(N = 100, max.depth = 2), explosion = list(max.depth = 3, prob = 0.3), pwave.arrival = list(H0 = 4, optim.starts = 15), first.polarity = list(read.err = 0.6)) ``` The call produces 160 sets of p-values, which will be split between training data and newly acquired data. Under the current version, it is advised to generate p-values for both these data sets with the same call to `pval_gen()` so that the same seismometer locations and variance in arrival times at these locations are consistent. Events and seismometers are located within a space that is 1000 km in X and Y coordinates, and has a depth of 30 km. `N = 50` seismometer locations, with a maximum depth of 2 km are randomly and uniformly distributed through the region. An `"explosion"` event has a maximum true depth of 2 km. When generating each set of p-values there is a probability of `prob = 0.2` that the set will be generated from an explosion. This `explosion$prob` parameter sets the ratio between events of `"explosion"` and `"earthquake"` within the data set, with some randomness within this ratio. The null hypothesis for testing event depth is set to greater or equal to 4 km. To reduce computation time of the simulation `optim.starts` is set to 15. This controls how exhaustive the search is for the gradient based estimation of the event location. A summary of the simulation can be inspected: ```{r pvalgen_summary} summary(simulated.pvalues) ``` The first column of the `simulated.pvalues` data frame contains a p-value related to estimated event depth, the second column contains the p-value related to polarity of first motion, and the third column contains the true event category. An example row of a single event is as follows: ```{r pvalgen_singlerow} simulated.pvalues[1,] ``` To show the functionality of the `cECM()` function, `simulated.pvalues` needs to be split into a training set and a set that would be used equivalent to if a new event which needed to be categorized occurred. In this case, we will use the last 10 sets of p-values generated. The p-values for `new.data` are tabulated after subsetting them below. ```{r pvalgen_traindf} train <- simulated.pvalues[1:150,] new.data <- simulated.pvalues[151:160,] knitr::kable(new.data, format = "html") ``` The `cECM()` function cannot use the `event` column of `new.data`. This column is saved in order to check the accuracy of the method, and then deleted from the `new.data` data frame. ```{r truecat} new.data.true <- new.data$event new.data$event <- NULL ``` We are now ready to use the `cECM()` function. # ECM First, we will use `cECM()` to fit a model to the training data, and then use this model fit to look at the evidence that the new data is from the distribution of p-values obtained from explosion data. It is possible to feed both the training data and the new data to the `cECM()` function simultaneously. However, the usage illustrated below is more realistic for an application where the model would first be trained and later be used. ```{r pagg_fit} fit <- cECM(x = train) ``` Then this object can be plotted: ```{r ecmplot} plot(fit) ``` This `fit` can be saved for later use. However, because we have some new data, we are ready to use this `fit` with the `new.data`. ```{r pagg_newdata} new.data.category <- cECM(x = fit, newdata = new.data) ``` We can then investigate the aggregate p-values for `new.data` belonging to the `"explosion"` distribution. Organizing and summarizing the output from `cECM()` allows for a comparison between the aggregate p-value and the true category. Ideally, the `explosion` column should have an aggregate p-value below the pre-determined threshold when the `new.data.true` column contains `"earthquake"`. ```{r pagg_summary} categories <- cbind(new.data.category$cECM, new.data.true) categories <- categories[c("explosion", "new.data.true")] knitr::kable(categories, format = "html") ``` # References