The CCI package implements computational tests for conditional independence as proposed in Thorjussen et al. (2024). To test the hypothesis \(Y \ind X \mid Z\), the method evaluates whether \(X\) provides additional out-of-sample predictive information about \(Y\) beyond \(Z\). Under conditional independence, no such improvement is expected. Formally, under the null hypothesis, a predictive model \(f_Y(x,z) = \mathbf{E}[Y \mid X=x, Z=z]\) should perform equivalently when \(X\) is replaced by an independent permutation \(X^*\), i.e. \[ \mathbf{E}[Y \mid X=x, Z=z] = \mathbf{E}[Y \mid X^*=x, Z=z]. \] Predictive performance metrics (e.g. RMSE) are used to compare models based on \(X\) and \(X^*\), yielding a test of conditional independence. \[ H_0: Y \perp X | Z \Rightarrow M(f_Y(X^*, Z)) = M(f_Y(X, Z)), \\ H_1: Y \not\perp X | Z \Rightarrow M(f_Y(X^*, Z)) \neq M(f_Y(X, Z)). \] Where \(M\) is a performance metric such as RMSE.
In order to compare \[\mathbf{E}[Y \mid X^*=x, Z=z]\] with \[\mathbf{E}[Y \mid X=x, Z=z]\], CCI builds a null distribution by Monte Carlo cross validation. Monte Carlo cross validation repeatedly splits the data into training and test sets, fits the model on the training set, and evaluates the performance on the test set. By permuting \(X\) multiple times, CCI generates a distribution of performance metrics under the null hypothesis of conditional independence. The observed performance metric is then compared to this null distribution to compute a p-value.
The CCI test is an indirect test, which replaces the problem of testing conditional independence with the problem of evaluating predictive performance. The CCI approach achieves two main goals, firstly it avoids difficult assumptions when testing conditional independence using permutations, such as exchangeability and assymptotic behaviour of test statistics; which makes it robust in terms of type I error control. Secondly, it facilitates a general testing framework for various different data types and structures, for instance testing conditional independence between two categorical variables. However, the downside of this approach is that it requires more computational resources and in many cases have a long runtime compared to most other CI tests. It can have moderate or low statistical power in small samples, particularly in complex data and the results can be sensitive to the choice of machine learning method and hyperparameters.
The CCI testing relies mainly in tree based machine learning models to construct the null distribution. Tree based models are flexible and can capture complex relationships in the data, making them suitable for a wide range of applications. And they are proven to be robust in predicting out-of-sample, even though trained deeply on the train data.
Throught these examples, we use simulated data where the ground truth
is known. ## how to run a basic conditional independence test using CCI
The function below generates a data set where Y is
conditionally independent of X given Z1 and
Z2. The functions are linear so it should be easy to test
conditional independence using any statistical test for conditional
independence.
NormalData <- function(N){
Z1 <- stats::rnorm(N,0,1)
Z2 <- stats::rnorm(N,0,1)
X <- stats::rnorm(N, Z1 + Z2, 1)
Y <- stats::rnorm(N, Z1 + Z2, 1)
df <- data.frame(Z1, Z2, X, Y)
return(df)
}The CCI.test() function is the main testing function,
required inputs are formula and data. If we
want to test the hypothesis Y _||_ X | Z1, Z2, this is
written as formula = Y ~ X | Z1 + Z2.
set.seed(123)
data <- NormalData(400)
simple_test <- CCI.test(formula = Y ~ X | Z1 + Z2, data = data, seed = 42) # Setting seed for reproducibility
simple_test
#>
#>
#>
#> data:
#> p-value = 0.1242The results produce a p-value of 0.81; we do not reject the null. If
we remove Z1 or Z2 from the conditioning set,
we should reject the null hypothesis since Y and
X are dependent given only one of the Z
variables. Now we wrapp the output inside a summary()
function to get a more detailed output.
fail_test <- CCI.test(formula = Y ~ X | Z1, data = data, seed = 42)
summary(fail_test)
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using rf
#> Formula: Y ~ X | Z1
#> Permutations: 160
#> Metric: RMSE
#> Tail: left
#> Statistic: 1.201
#> P-value: 0.006211
#>
#> Subsample: 1The detailed output gives us the method of testing, which is
CCI test using rf, meaning we are using the CCI method with
a random forest learner. The formula states that we are testing the
condition Y ~ X | Z1. Permutations indicate the number of
Monte Carlo samples which where used to create the null distribution
(160 is default). The perfromance metric \(M\) is RMSE, which is the default metric
for continuous outcomes. Tail: left means that when the
test statistic is to the left of the bulk of the null distribution
(lower RMSE is better), then it means that including \(X\) improves the prediction of \(Y\) beyond \(X^*\) and \(Z1\). The value of the test statisic is
1.317 and the calculated p-value is 0.006211. Note that the calculate
p-value is empirical, and has the lowest possible value, i.e. 1/161 =
0.006211. Subsample: 1 means that the whole sample where
used in each Monte Carlo run to produce the null distribution.
If we want to use xgboost instead of random forest, we
can set the argument method = "xgboost". We also set some
other arguments, nperm = 250 means that instead of the
default 160 Monte Carlo iterations, we now perform 250, this will give a
more precises p-value. parametric = TRUE means that the
p-value is calculated assuming the null distribution is Gaussian. The
nrounds = 400 argument sets the number of rounds (trees)
for methods xgboost and rf (default is 600).
We also pass in some hyperparameter settings for the XGBoost
algorithm.
summary(CCI.test(formula = Y ~ X | Z1,
data = data,
method = 'xgboost',
nperm = 250,
parametric = TRUE,
nrounds = 400,
min_child_weight = 2,
colsample_bytree = 2,
eta = 0.1,
max_depth = 10,
seed = 42))
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using xgboost
#> Formula: Y ~ X | Z1
#> Permutations: 250
#> Metric: RMSE
#> Tail: left
#> Statistic: 1.394
#> P-value: 8.384e-07
#>
#> Subsample: 1The null distribution can be visualized using the plot
function.
set.seed(13)
dat <- NormalData(500)
cci <- CCI.test(Y ~ X | Z1 + Z2, data = dat, seed = 1)
plot(cci)Let’s generate some difficult data.
HardCase <- function(N) {
Z1 <- stats::runif(N, -2, 2)
Z2 <- stats::runif(N, -2, 2)
hZ <- sin(Z1) * cos(Z2)
X <- hZ + 0.2 * stats::rnorm(N)
Y <- hZ^2 + 0.2 * stats::rnorm(N)
data.frame(X, Y, Z1, Z2)
}Even with 5000 samples, the default settings in CCI starts to become slow.
set.seed(1)
data <- HardCase(5000)
tic <- Sys.time()
summary(CCI.test(Y ~ X | Z1 + Z2, data = data))
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using rf
#> Formula: Y ~ X | Z1 + Z2
#> Permutations: 160
#> Metric: RMSE
#> Tail: left
#> Statistic: 0.2104
#> P-value: 0.5093
#>
#> Subsample: 0.28
toc <- Sys.time()
toc - tic
#> Time difference of 29.52549 secsThe most efficient way of speeding up the runtime is the use the
method KNN.
tic <- Sys.time()
summary(CCI.test(Y ~ X | Z1 + Z2, data = data, method = "KNN"))
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using KNN
#> Formula: Y ~ X | Z1 + Z2
#> Permutations: 160
#> Metric: RMSE
#> Tail: left
#> Statistic: 0.2066
#> P-value: 0.06211
#>
#> Subsample: 0.28
toc <- Sys.time()
toc - tic
#> Time difference of 3.07964 secsHowever, the KNN method is not as robust as the tree based methods.
If you notice the summary output above, the subsample line is now
changed to 0.3, meaning that when testing conditional independence, only
30% of the data is used in each Monte Carlo iteration, however, it is a
different 30 % for each iteration. This is automatically set to speed up
the testing runtime. However, when using KNN it is often
not necessary unless the sample is very large. In the example below we
set the subsample argument to "No", meaning no
sub-sampling will occur (except what’s inside the learner default
settings).
tic <- Sys.time()
summary(CCI.test(Y ~ X | Z1 + Z2, data = data, method = "KNN", subsample = "No"))
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using KNN
#> Formula: Y ~ X | Z1 + Z2
#> Permutations: 160
#> Metric: RMSE
#> Tail: left
#> Statistic: 0.213
#> P-value: 0.9006
#>
#> Subsample: 1
toc <- Sys.time()
toc - tic
#> Time difference of 19.20926 secsIn the last example, test with 50000 observations, using
KNN with default subsample still gives a fast runtime.
set.seed(1)
data <- HardCase(50000)
tic <- Sys.time()
summary(CCI.test(Y ~ X | Z1, data = data, method = "KNN"))
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using KNN
#> Formula: Y ~ X | Z1
#> Permutations: 160
#> Metric: RMSE
#> Tail: left
#> Statistic: 0.2517
#> P-value: 0.006211
#>
#> Subsample: 0.049
toc <- Sys.time()
toc - tic
#> Time difference of 5.067082 secsCCI automatically adapts to the outcome type. For continuous outcomes, it uses regression methods and metrics RMSE; for categorical outcomes, it uses classification methods and metrics Kappa.
ExpLogThreshold <- function(N) {
Z1 <- stats::rnorm(N)
Z2 <- stats::rnorm(N)
X <- exp(Z1) + Z2 + stats::rnorm(N, 0, 0.2)
Y <- ifelse(log(abs(Z1) + 1) + Z2 > 0.5, "Goblin",
ifelse(log(abs(Z1) + 1) + Z2 > 0, "Orc",
ifelse(log(abs(Z1) + 1) > -0.5, "Troll", "Elf")))
return(data.frame(Z1, Z2, X, Y))
}By setting the outcome Y as a factor, CCI recognizes it
as a classification problem and automatically switches to classification
methods and metrics.
set.seed(123)
dat <- ExpLogThreshold(500)
dat$Y <- as.factor(dat$Y)
summary(CCI.test(Y ~ X | Z1 + Z2, data = dat))
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using rf
#> Formula: Y ~ X | Z1 + Z2
#> Permutations: 160
#> Metric: Kappa
#> Tail: right
#> Statistic: 0.8883
#> P-value: 0.4907
#>
#> Subsample: 1
summary(CCI.test(Y ~ X | Z1, data = dat))
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using rf
#> Formula: Y ~ X | Z1
#> Permutations: 160
#> Metric: Kappa
#> Tail: right
#> Statistic: 0.6934
#> P-value: 0.006211
#>
#> Subsample: 1The user can also manually specify the metric to use via the
metric argument.
set.seed(123)
dat <- ExpLogThreshold(500)
dat$Y <- as.numeric(as.factor(dat$Y))
summary(CCI.test(Y ~ X | Z1 + Z2, data = dat, metric = "Kappa"))
#>
#> Computational Conditional Independence Test
#> --------------------------------------------
#> Method: CCI test using rf
#> Formula: Y ~ X | Z1 + Z2
#> Permutations: 160
#> Metric: Kappa
#> Tail: right
#> Statistic: -0.386
#> P-value: 0.4783
#>
#> Subsample: 1The dashed line indicates the observed test statistic, while the histogram represents the null distribution generated through permutations.