---
title: "aggregate-pvalue-sim"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{aggregate-pvalue-sim}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography:  '`r system.file("REFERENCES.bib", package = "ezECM")`'
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The purpose of this vignette is to illustrate how one would use the `pval_gen()` function to simulate p-values and then feed these p-values to the `cECM()` function to illustrate the performance of event categorization techniques.  Alternatively, one could use this vignette to learn how to feed sets of real training p-values along with newly generated p-values to `cECM()`.

First, let's load the package:

```{r setup}
library(ezECM)
```

# Simulating p-values

The `pval_gen()` function simulates p-values generated from the two discriminants of event depth and polarity of motion, for the event types of `"earthquake"` and `"explosion"`.  Details for the methods for calculating the specific discriminant p-values are shown in @anderson2007mathematical.

An example of a call to the `pval_gen()` function is shown below.

```{r pvalgen}
simulated.pvalues <- pval_gen(sims = 160, grid.dim = c(1000,1000,30), 
                              seismometer = list(N = 100, max.depth = 2),
                              explosion = list(max.depth = 3, prob = 0.3),
                              pwave.arrival = list(H0 = 4, optim.starts = 15),
                              first.polarity  = list(read.err = 0.6))
```

The call produces 160 sets of p-values, which will be split between training data and newly acquired data.  Under the current version, it is advised to generate p-values for both these data sets with the same call to `pval_gen()` so that the same seismometer locations and variance in arrival times at these locations are consistent.  Events and seismometers are located within a space that is 1000 km in X and Y coordinates, and has a depth of 30 km.  `N = 50` seismometer locations, with a maximum depth of 2 km are randomly and uniformly distributed through the region.  An `"explosion"` event has a maximum true depth of 2 km.  When generating each set of p-values there is a probability of `prob = 0.2` that the set will be generated from an explosion.  This `explosion$prob` parameter sets the ratio between events of `"explosion"` and `"earthquake"` within the data set, with some randomness within this ratio.  The null hypothesis for testing event depth is set to greater or equal to 4 km.  To reduce computation time of the simulation `optim.starts` is set to 15.  This controls how exhaustive the search is for the gradient based estimation of the event location.

A summary of the simulation can be inspected:

```{r pvalgen_summary}
summary(simulated.pvalues)
```

The first column of the `simulated.pvalues` data frame contains a p-value related to estimated event depth, the second column contains the p-value related to polarity of first motion, and the third column contains the true event category.  An example row of a single event is as follows:

```{r pvalgen_singlerow}
simulated.pvalues[1,]
```

To show the functionality of the `cECM()` function, `simulated.pvalues` needs to be split into a training set and a set that would be used equivalent to if a new event which needed to be categorized occurred.  In this case, we will use the last 10 sets of p-values generated.  The p-values for `new.data` are tabulated after subsetting them below.

```{r pvalgen_traindf}
train <- simulated.pvalues[1:150,]
new.data <- simulated.pvalues[151:160,]

knitr::kable(new.data, format = "html")
```

The `cECM()` function cannot use the `event` column of `new.data`.  This column is saved in order to check the accuracy of the method, and then deleted from the `new.data` data frame.

```{r truecat}
new.data.true <- new.data$event
new.data$event <- NULL
```

We are now ready to use the `cECM()` function.

# ECM

First, we will use `cECM()` to fit a model to the training data, and then use this model fit to look at the evidence that the new data is from the distribution of p-values obtained from explosion data.  It is possible to feed both the training data and the new data to the `cECM()` function simultaneously.  However, the usage illustrated below is more realistic for an application where the model would first be trained and later be used.

```{r pagg_fit}
fit <- cECM(x = train)
```

Then this object can be plotted:

```{r ecmplot}
plot(fit)
```

This `fit` can be saved for later use.  However, because we have some new data, we are ready to use this `fit` with the `new.data`.

```{r pagg_newdata}
new.data.category <- cECM(x = fit, newdata = new.data)
```

We can then investigate the aggregate p-values for `new.data` belonging to the `"explosion"` distribution.  Organizing and summarizing the output from `cECM()` allows for a comparison between the aggregate p-value and the true category.  Ideally, the `explosion` column should have an aggregate p-value below the pre-determined threshold when the `new.data.true` column contains `"earthquake"`.

```{r pagg_summary}
categories <- cbind(new.data.category$cECM, new.data.true)
categories <- categories[c("explosion", "new.data.true")]

knitr::kable(categories, format = "html")
```


#  References