[R-meta] Performance of metafor::vcalc() vs clubSandwich::impute_covariance_matrix()

James Pustejovsky jepu@to @end|ng |rom gm@||@com
Tue Aug 6 20:50:08 CEST 2024


Hi Tamar,

Thanks for the reprex. Wow this surprised me. It looks like the biggest
source of compute time in vcalc() is assignments to the sparse matrix
representation of V. There might be a way to improve efficiency there, but
will have to confer with Wolfgang.

If you are able to reproduce the other issue with getting different results
between vcalc() and impute_covariance_matrix(), please do share.

Just to provide a little bit of additional context, I am deprecating
impute_covariance_matrix() because it is an outlier in the clubSandwich
package. It is essentially the only function in the package that is
specific to meta-analysis workflows. All other package functionality
involves computing cluster-robust variance estimators and associated
quantities (hypothesis tests, confidence intervals, etc.). In terms of
conceptual organization, it makes more sense for a function like
impute_covariance_matrix() to live in metafor. However,
impute_covariance_matrix() is not going to disappear entirely from
clubSandwich, so you could certainly continue using it in the short term
for purposes of computational efficiency.

James

On Tue, Aug 6, 2024 at 10:19 AM Tamar Novetsky <tamar using growprogress.ai>
wrote:

> Thanks so much, James! Unfortunately, I didn't find a big enough
> improvement in performance using vcalc(sparse = TRUE) - in the example
> below, the default vcalc arguments take ~100x longer than
> impute_covariance_matrix, while vcalc(sparse = TRUE) takes ~60x longer.
>
> I couldn't reproduce the 2x values using non-proprietary data, so there
> might just be something weird going on with my dataset!
>
> Reproducible example (adapted from metafor's examples in the vcalc
> function documentation):
> ```
> library(tidyverse)
> library(metafor)
> library(clubSandwich)
> library(microbenchmark)
> set.seed(42)
>
> # example data from metafor
> dat <- dat.assink2016
>
> # augment data so it has >1500 rows
> new_rows <-
>   tibble(
>     study = 18:167,
>     n_esid = sample(x = 1:max(dat$esid), size = 150, replace = TRUE)
>   ) %>%
>   uncount(n_esid) %>%
>   group_by(study) %>%
>   mutate(esid = row_number()) %>%
>   ungroup() %>%
>   mutate(
>     id = row_number() + 100,
>     yi = rnorm(nrow(.), mean(dat$yi), sd(dat$yi)),
>     vi = rnorm(nrow(.), mean(dat$vi), sd(dat$vi)),
>     vi = if_else(vi < 0, -1*vi, vi), # make sure vi is always positive
>     pubstatus = sample(x = dat$pubstatus, size = nrow(.), replace = TRUE),
>     year = sample(x = dat$year, size = nrow(.), replace = TRUE),
>     deltype = sample(x = dat$deltype, size = nrow(.), replace = TRUE)
>   )
> dat_big <- bind_rows(dat, new_rows)
>
> # benchmark performance with full matrix (this takes a minute to run)
> res <- microbenchmark(
>   "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
> 0.6),
>   "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
> dat_big$study, r = 0.6, return_list = FALSE),
>   times = 10
> )
> summary(res)
>
> # benchmark performance with sparse matrix (also takes a minute to run)
> res_sparse <- microbenchmark(
>   "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
> 0.6, sparse = TRUE),
>   "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
> dat_big$study, r = 0.6, return_list = FALSE),
>   times = 10
> )
> summary(res_sparse)
> ```
>
> Thanks again,
>
>
> *Tamar Novetsky* *(she/her)*
> Data Scientist I
> Eastern Time Zone
>
>
> On Tue, Aug 6, 2024 at 10:20 AM James Pustejovsky <jepusto using gmail.com>
> wrote:
>
>> Hi Tamar,
>>
>> The difference in compute time is because of a difference in how the
>> default output of these functions is structured.
>> clubSandwich::impute_covariance_matrix() returns a block-diagonal by
>> default. metafor::vcalc() returns a full (dense) matrix by default. Say
>> that you have J studies and study j has kj effect sizes. The block-diagonal
>> matrix has sum(kj^2) entries, whereas the full matrix has sum(kj)^2
>> entries. If J is large and the kjs are mostly small, this can make for a
>> really big difference in object size. However, setting the option
>> vcalc(sparse = TRUE) will return a block-diagonal matrix and should lead to
>> performance comparable to impute_covariance_matrix().
>>
>> Regarding your second question, I'm not sure what might be going on.
>> Could you provide a reproducible example?
>>
>> James
>>
>> On Tue, Aug 6, 2024 at 8:20 AM Tamar Novetsky via R-sig-meta-analysis <
>> r-sig-meta-analysis using r-project.org> wrote:
>>
>>> Hello,
>>>
>>> I am working on a script to run multiple meta-regressions on different
>>> subsets of the same dataset, and have been
>>> using clubSandwich::impute_covariance_matrix() to generate the
>>> variance-covariance matrix necessary as an input to metafor::rma.mv().
>>> However, I recently learned that impute_covariance_matrix() has been
>>> superseded by metafor::vcalc(), so I have been working to replace my
>>> usage
>>> of the former function with the latter. In that process, I discovered
>>> that
>>> vcalc() seems to be much slower than impute_covariance_matrix() - about
>>> 150x slower in one use case that I benchmarked using the microbenchmark
>>> package. Since I will be running this many times in a loop, performance
>>> matters quite a lot to me in this context.
>>>
>>> Can anyone help me understand why vcalc() would be so much slower? Is it
>>> possible that I'm using it incorrectly?
>>>
>>> Secondly/possibly relatedly, I found that the results from vcalc() are
>>> always either exactly the same or exactly double the results from
>>> impute_covariance_matrix(). Does anyone have a sense of why that would
>>> be?
>>> Could that be related to the performance differences?
>>>
>>> Thanks so much for your help,
>>>
>>>
>>> *Tamar Novetsky* *(she/her)*
>>> Data Scientist I
>>> Eastern Time Zone
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-meta-analysis mailing list @ R-sig-meta-analysis using r-project.org
>>> To manage your subscription to this mailing list, go to:
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis
>>>
>>

	[[alternative HTML version deleted]]



More information about the R-sig-meta-analysis mailing list