[R-meta] Descriptive Multilevel Meta-analysis with C-Statistics

Mon May 19 12:31:33 CEST 2025

Dear meta-analysis community

We aim to explore whether, on average across various settings and
samples, Random Forests, XGBoost, and Logistic Regression differ in
their ability to discriminate the same medical outcome. In our example
dataset, the c-statistics (AUCs) have been logit-transformed (yi) and
are accompanied by their variances (vi).

Specifically:

a) We are pooling internally validated c-statistics.

b) We are comparing different machine learning model types (e.g.,
Logistic Regression, Random Forest, XGBoost), typically applied to the
same sample within each study.

c) We are analyzing highly nested data, where predictor sets may
differ across models, even within the same study (e.g., some models
may include blood markers while others do not).

d) We are not comparing identical models across studies, and within
studies, model comparisons do not always use the same predictor sets.

First of all, would this sort of a descriptive meta-analysis be
sensible given the complexity of data. Secondly, is the current
modelling approach sensible or should it be made simpler or more
complex in principle (e.g., subsetting the outcomes or having more
levels of nesting)?

Thank you very much in advance,
Mika Manninen

***

Data:

structure(list(cluster_id = c(1.1, 10.1, 11.1, 11.2, 12.1, 12.2,
12.3, 13.1, 13.1, 13.1, 14.1, 14.1, 14.1, 15.1, 16.1, 16.1, 16.1,
16.2, 16.2), row_id = c(1, 95, 96, 97, 98, 101, 104, 107, 108,
110, 113, 114, 116, 119, 120, 121, 122, 128, 129), study_id = c(1,
10, 11, 11, 12, 12, 12, 13, 13, 13, 13, 13, 13, 14, 15, 15, 15,
15, 15), sample_id = c(1, 10, 11, 11, 12, 12, 12, 13, 13, 13,
14, 14, 14, 15, 16, 16, 16, 16, 16), predictor_id = c(1, 1, 1,
2, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2), model_id = c(1,
1, 1, 2, 1, 1, 1, 1, 2, 4, 1, 2, 4, 1, 1, 2, 3, 1, 2), model_type =
c("Logistic Regression",
"Logistic Regression", "Logistic Regression", "XGBoost", "XGBoost",
"XGBoost", "XGBoost", "Random Forest", "Logistic Regression",
"XGBoost", "Random Forest", "Logistic Regression", "XGBoost",
"Logistic Regression", "Logistic Regression", "Random Forest",
"XGBoost", "Logistic Regression", "Random Forest"), yi = c(0.78,
0.94, 1.07, 1.09, 0.90,
0.63, 0.94, 0.10, -0.61,
0.20, 0.55, 0.28, 1.00,
1.11, 0.71, 1.13, 1.14,
1.62, 2.48), vi = c(0.02,
0.002, 0.073, 0.011, 0.00055,
0.00055, 0.056, 0.13,
0.15, 0.13, 0.12, 0.11,
0.15, 0.012, 0.0036, 0.005,
0.009, 0.02, 0.023)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -19L))

R-code:

# 9.1 Vcalc
# construct correlation matrix for logistic regression, xbboost, and
random forest (fake correlations here)

A <- matrix(0.7, nrow=3, ncol=3)
A[3,1] <- A[1,3] <- 0.8
A[2,3] <- A[3,2] <- 0.9
diag(A) <- 1
rownames(A) <- colnames(A) <- c("Logistic Regression", "Random
Forest", "XGBoost")
A

# construct the variance covariance matrix
V <- vcalc(vi = vi, cluster = cluster_id, type = model_type,  data =
dat.ml, rho = A)
V

# Multivariate Meta-analysis (perhaps should be mods = ~
factor(model_type) - 1, random = ~ 1 | sample_id/predictor_id, data =
dat.ml)

res.ml <- rma.mv(yi, V, mods = ~ factor(model_type)-1, random =
~factor(model_type) | cluster_id, struct = "UN",  data = dat.ml,
method = "REML")
res.ml