---
title: "Causal Conditional Distance Correlation"
author: "Eric W. Bridgeford"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{cb.detect.caus_cdcorr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, message=FALSE}
require(causalBatch)
require(ggplot2)
require(tidyr)
n = 200
```

To start, we will begin with a simulation example, similar to the ones we were working in for the simulations, which you can access from:

```{r, eval=FALSE}
vignette("cb.simulations", package="causalBatch")
```

Let's regenerate our working example data with some plotting code:

```{r}
# a function for plotting a scatter plot of the data
plot.sim <- function(Ys, Ts, Xs, title="", 
                     xlabel="Covariate",
                     ylabel="Outcome (1st dimension)") {
  data = data.frame(Y1=Ys[,1], Y2=Ys[,2], 
                    Group=factor(Ts, levels=c(0, 1), ordered=TRUE), 
                    Covariates=Xs)
  
  data %>%
    ggplot(aes(x=Covariates, y=Y1, color=Group)) +
    geom_point() +
    labs(title=title, x=xlabel, y=ylabel) +
    scale_x_continuous(limits = c(-1, 1)) +
    scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"), 
                       name="Group/Batch") +
    theme_bw()
}
```

Next, we will generate a simulation:

```{r, fig.width=5, fig.height=3}
sim = cb.sims.sim_sigmoid(n=n, eff_sz=1, unbalancedness=1.5)

plot.sim(sim$Ys, sim$Ts, sim$Xs, title="Sigmoidal Simulation")
```

Despite the fact that the covariate distributions for each group/batch do not overlap perfectly (note that `unbalancedness` is not $1$), it looks like the two batches still appear to be slightly different. We can test this using the causal conditional distance correlation, like so:

```{r}
result <- cb.detect.caus_cdcorr(sim$Ys, sim$Ts, sim$Xs, R=100)
```

Here, we set the number of null replicates `R` to $100$ to make the simulation run faster, but in practice you should typically use at least $1000$ null replicates. To make this faster, we would suggest setting `num.threads` to be close to the maximum number of cores available on your machine. You can identify the number of cores available on your machine using `parallel::detectCores()`.

With the $\alpha$ of the test at $0.05$, we see that the $p$-value is:

```{r}
print(sprintf("p-value: %.4f", result$Test$p.value))
```

Since the $p$-value is $< \alpha$, we reject the null hypothesis in favor of the alternative; that is, that the group/batch causes a difference in the outcome variable.

We could optionally have pre-computed a distance matrix for the outcomes, like so:

```{r}
# compute distance matrix for outcomes
DY = dist(sim$Ys)
```

In your use-cases, you could substitute this distance function for any distance function of your choosing, and pass a distance matrix directly to the detection algorithm, by specifying that `distance=TRUE`:

```{r}
result <- cb.detect.caus_cdcorr(DY, sim$Ts, sim$Xs, distance=TRUE, R=100)
```