[R-meta] Descriptive Multilevel Meta-analysis with C-Statistics

Mon May 19 17:34:37 CEST 2025

Dear Wolfgang,

Thank you very much for your insightful reply, it is greatly appreciated.

The A <- A1 %x% A2 -- a product of two matrices is something that I
did not think about at all.

Out of curiosity, we have reasonable estimates to construct the
matrices, but when the model includes the product of two matrices, are
there any guidelines for conducting sensitivity analyses --
particularly regarding how many alternative correlation structures
should be tested or how would you think about this issue in general?

Kind regards,
Mika

ma 19.5.2025 klo 12.44 Viechtbauer, Wolfgang (NP)
(wolfgang.viechtbauer using maastrichtuniversity.nl) kirjoitti:
>
> Dear Mika,
>
> I think this is quite sensible. Some things to consider:
>
> 1) I would not only consider 'model_type' as a predictor, but also what predictors were included in the models. I think this relates to your 'predictor_id' variable.
>
> 2) As has been discussed before on this list, there are two types of dependencies to consider. If the same data are analyzed using different methods/models, then the sampling errors of the corresponding yi values are not independent. I think your 'sample_id' variable indicates whether this is the case. This kind of dependency should ideally be reflected in the V matrix and one can also add random effects accordingly. However, even two yi values are based on different samples (and hence their sampling errors are independent), if they come from the same study, then there may be dependeny because results reported for the same study may be more alike than results reported from different studies. I think 'study_id' is the relevant variable here. This kind of dependency is accounted for via random effects.
>
> So in your construction of the V matrix, I would consider using:
>
> V <- vcalc(vi = vi, cluster = sample_id, type = model_type, data = dat.ml, rho = A)
>
> But this induces a correlation of 1 for yi values analyzed with the same model but different predictor_id values (which applies to sample_id values 12 and 16 hence you get a warning about that).
>
> So you could expand the A matrix to also consider the predictor_id values and your 'cluster' variable is then a combiantion of model_type and predictor_id. Something like:
>
> A1 <- matrix(0.7, nrow=3, ncol=3)
> A1[3,1] <- A1[1,3] <- 0.8
> A1[2,3] <- A1[3,2] <- 0.9
> diag(A1) <- 1
> rownames(A1) <- colnames(A1) <- c("Logistic Regression", "Random Forest", "XGBoost")
> A1
>
> A2 <- matrix(0.9, nrow=3, ncol=3)
> diag(A2) <- 1
> rownames(A2) <- colnames(A2) <- 1:3
>
> A <- A1 %x% A2
> rownames(A) <- colnames(A) <- sapply(rownames(A1), function(x) paste0(x, "_", rownames(A2)))
> A
>
> dat.ml$model_type_predictor_id <- paste0(dat.ml$model_type, "_", dat.ml$predictor_id)
>
> V <- vcalc(vi = vi, cluster = cluster_id, type = model_type_predictor_id,  data = dat.ml, rho = A)
> V
>
> Of course construction of A and hence V is guesswork, but we always have cluster-robust inference methods in the end to account for the fact that V (and hence our model) is just an approximation.
>
> As for the random effects structure, it depends a bit on how much data you have. A start may be:
>
> res.ml <- rma.mv(yi, V, mods = ~ factor(model_type_predictor_id)-1,
>                  random = ~ 1 | study_id/sample_id/model_type/predictor_id, data = dat.ml)
> res.ml
>
> This implicitly assumes a constant correlation for sample_id within study_id (which is sensible), a different but constant correlation for model_type within sample_id (which is a bit less sensible), and a different but constant correlation for predictor_id within model_type (again a bit overly simplistic). To relax the constant correlation part, one would have to go to 'inner ~ outer' random effects, but the number of parameters quickly increases then.
>
> In the end, you want to do:
>
> robust(res.ml, cluster=study_id, clubSandwich=TRUE)
>
> anyway.
>
> Best,
> Wolfgang
>
> > -----Original Message-----
> > From: R-sig-meta-analysis <r-sig-meta-analysis-bounces using r-project.org> On Behalf
> > Of Mika Manninen via R-sig-meta-analysis
> > Sent: Monday, May 19, 2025 12:32
> > To: r-sig-meta-analysis using r-project.org
> > Cc: Mika Manninen <mixu89 using gmail.com>
> > Subject: [R-meta] Descriptive Multilevel Meta-analysis with C-Statistics
> >
> > Dear meta-analysis community
> >
> > We aim to explore whether, on average across various settings and
> > samples, Random Forests, XGBoost, and Logistic Regression differ in
> > their ability to discriminate the same medical outcome. In our example
> > dataset, the c-statistics (AUCs) have been logit-transformed (yi) and
> > are accompanied by their variances (vi).
> >
> > Specifically:
> >
> > a) We are pooling internally validated c-statistics.
> >
> > b) We are comparing different machine learning model types (e.g.,
> > Logistic Regression, Random Forest, XGBoost), typically applied to the
> > same sample within each study.
> >
> > c) We are analyzing highly nested data, where predictor sets may
> > differ across models, even within the same study (e.g., some models
> > may include blood markers while others do not).
> >
> > d) We are not comparing identical models across studies, and within
> > studies, model comparisons do not always use the same predictor sets.
> >
> > First of all, would this sort of a descriptive meta-analysis be
> > sensible given the complexity of data. Secondly, is the current
> > modelling approach sensible or should it be made simpler or more
> > complex in principle (e.g., subsetting the outcomes or having more
> > levels of nesting)?
> >
> > Thank you very much in advance,
> > Mika Manninen
> >
> > ***
> >
> > Data:
> >
> > structure(list(cluster_id = c(1.1, 10.1, 11.1, 11.2, 12.1, 12.2,
> > 12.3, 13.1, 13.1, 13.1, 14.1, 14.1, 14.1, 15.1, 16.1, 16.1, 16.1,
> > 16.2, 16.2), row_id = c(1, 95, 96, 97, 98, 101, 104, 107, 108,
> > 110, 113, 114, 116, 119, 120, 121, 122, 128, 129), study_id = c(1,
> > 10, 11, 11, 12, 12, 12, 13, 13, 13, 13, 13, 13, 14, 15, 15, 15,
> > 15, 15), sample_id = c(1, 10, 11, 11, 12, 12, 12, 13, 13, 13,
> > 14, 14, 14, 15, 16, 16, 16, 16, 16), predictor_id = c(1, 1, 1,
> > 2, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2), model_id = c(1,
> > 1, 1, 2, 1, 1, 1, 1, 2, 4, 1, 2, 4, 1, 1, 2, 3, 1, 2), model_type =
> > c("Logistic Regression",
> > "Logistic Regression", "Logistic Regression", "XGBoost", "XGBoost",
> > "XGBoost", "XGBoost", "Random Forest", "Logistic Regression",
> > "XGBoost", "Random Forest", "Logistic Regression", "XGBoost",
> > "Logistic Regression", "Logistic Regression", "Random Forest",
> > "XGBoost", "Logistic Regression", "Random Forest"), yi = c(0.78,
> > 0.94, 1.07, 1.09, 0.90,
> > 0.63, 0.94, 0.10, -0.61,
> > 0.20, 0.55, 0.28, 1.00,
> > 1.11, 0.71, 1.13, 1.14,
> > 1.62, 2.48), vi = c(0.02,
> > 0.002, 0.073, 0.011, 0.00055,
> > 0.00055, 0.056, 0.13,
> > 0.15, 0.13, 0.12, 0.11,
> > 0.15, 0.012, 0.0036, 0.005,
> > 0.009, 0.02, 0.023)), class = c("tbl_df",
> > "tbl", "data.frame"), row.names = c(NA, -19L))
> >
> > R-code:
> >
> > # 9.1 Vcalc
> > # construct correlation matrix for logistic regression, xbboost, and
> > random forest (fake correlations here)
> >
> > A <- matrix(0.7, nrow=3, ncol=3)
> > A[3,1] <- A[1,3] <- 0.8
> > A[2,3] <- A[3,2] <- 0.9
> > diag(A) <- 1
> > rownames(A) <- colnames(A) <- c("Logistic Regression", "Random
> > Forest", "XGBoost")
> > A
> >
> > # construct the variance covariance matrix
> > V <- vcalc(vi = vi, cluster = cluster_id, type = model_type,  data =
> > dat.ml, rho = A)
> > V
> >
> > # Multivariate Meta-analysis (perhaps should be mods = ~
> > factor(model_type) - 1, random = ~ 1 | sample_id/predictor_id, data =
> > dat.ml)
> >
> > res.ml <- rma.mv(yi, V, mods = ~ factor(model_type)-1, random =
> > ~factor(model_type) | cluster_id, struct = "UN",  data = dat.ml,
> > method = "REML")
> > res.ml