[R-meta] Performance of metafor::vcalc() vs clubSandwich::impute_covariance_matrix()

Tue Aug 13 11:45:15 CEST 2024

Hi all,

Chiming in here after a while (coming back to the office after my summer break).

I didn't pay much attention to speed issues when writing vcalc(), since model fitting with large datasets is going to be the bottleneck, not the construction of the V matrix (whether constructing V takes 0.1 or 1 second hardly matters when model fitting can take minutes). Also, due to the additional functionality vcalc() provides, some of the internal computations are going to be slower. Not sure if there is a way to use Map() (as cleverly used in impute_covariance_matrix()) to avoid the double-loop I use in vcalc() -- pull requests welcome!

However, for the particular use case given here, one can speed things up further by distilling things down to the mere essence. For example:

vfast <- function(vi, study, rho) {

   ids <- unique(study)
   out <- lapply(ids, function(id) {
      sel <- study == id
      n <- sum(sel)
      R <- matrix(rho, n, n)
      diag(R) <- 1
      S <- diag(sqrt(vi[sel]), n, n)
      S %*% R %*% S
   })
   bldiag(out)

}

all.equal(vcalc(vi, cluster = study, obs = esid, data = dat_big, rho = 0.6),
          vfast(dat_big$vi, dat_big$study, rho = 0.6))

# benchmark performance with full matrix
res <- microbenchmark(
  "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho = 0.6),
  "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster = dat_big$study, r = 0.6, return_list = FALSE),
  "vfast" = vfast(dat_big$vi, dat_big$study, rho = 0.6),
  times = 10
)
summary(res)

This is now around 5 times faster than impute_covariance_matrix() (about 0.02 seconds on average) and I assume one could speed this up even further.

@James: The deprecation message (Please use `metaffor::vcalc()` instead.) has a typo (should be metafor::vcalc()).

Best,
Wolfgang

> -----Original Message-----
> From: R-sig-meta-analysis <r-sig-meta-analysis-bounces using r-project.org> On Behalf
> Of James Pustejovsky via R-sig-meta-analysis
> Sent: Tuesday, August 6, 2024 20:50
> To: Tamar Novetsky <tamar using growprogress.ai>
> Cc: James Pustejovsky <jepusto using gmail.com>; R Special Interest Group for Meta-
> Analysis <r-sig-meta-analysis using r-project.org>
> Subject: Re: [R-meta] Performance of metafor::vcalc() vs
> clubSandwich::impute_covariance_matrix()
>
> Hi Tamar,
>
> Thanks for the reprex. Wow this surprised me. It looks like the biggest
> source of compute time in vcalc() is assignments to the sparse matrix
> representation of V. There might be a way to improve efficiency there, but
> will have to confer with Wolfgang.
>
> If you are able to reproduce the other issue with getting different results
> between vcalc() and impute_covariance_matrix(), please do share.
>
> Just to provide a little bit of additional context, I am deprecating
> impute_covariance_matrix() because it is an outlier in the clubSandwich
> package. It is essentially the only function in the package that is
> specific to meta-analysis workflows. All other package functionality
> involves computing cluster-robust variance estimators and associated
> quantities (hypothesis tests, confidence intervals, etc.). In terms of
> conceptual organization, it makes more sense for a function like
> impute_covariance_matrix() to live in metafor. However,
> impute_covariance_matrix() is not going to disappear entirely from
> clubSandwich, so you could certainly continue using it in the short term
> for purposes of computational efficiency.
>
> James
>
> On Tue, Aug 6, 2024 at 10:19 AM Tamar Novetsky <tamar using growprogress.ai>
> wrote:
>
> > Thanks so much, James! Unfortunately, I didn't find a big enough
> > improvement in performance using vcalc(sparse = TRUE) - in the example
> > below, the default vcalc arguments take ~100x longer than
> > impute_covariance_matrix, while vcalc(sparse = TRUE) takes ~60x longer.
> >
> > I couldn't reproduce the 2x values using non-proprietary data, so there
> > might just be something weird going on with my dataset!
> >
> > Reproducible example (adapted from metafor's examples in the vcalc
> > function documentation):
> > ```
> > library(tidyverse)
> > library(metafor)
> > library(clubSandwich)
> > library(microbenchmark)
> > set.seed(42)
> >
> > # example data from metafor
> > dat <- dat.assink2016
> >
> > # augment data so it has >1500 rows
> > new_rows <-
> >   tibble(
> >     study = 18:167,
> >     n_esid = sample(x = 1:max(dat$esid), size = 150, replace = TRUE)
> >   ) %>%
> >   uncount(n_esid) %>%
> >   group_by(study) %>%
> >   mutate(esid = row_number()) %>%
> >   ungroup() %>%
> >   mutate(
> >     id = row_number() + 100,
> >     yi = rnorm(nrow(.), mean(dat$yi), sd(dat$yi)),
> >     vi = rnorm(nrow(.), mean(dat$vi), sd(dat$vi)),
> >     vi = if_else(vi < 0, -1*vi, vi), # make sure vi is always positive
> >     pubstatus = sample(x = dat$pubstatus, size = nrow(.), replace = TRUE),
> >     year = sample(x = dat$year, size = nrow(.), replace = TRUE),
> >     deltype = sample(x = dat$deltype, size = nrow(.), replace = TRUE)
> >   )
> > dat_big <- bind_rows(dat, new_rows)
> >
> > # benchmark performance with full matrix (this takes a minute to run)
> > res <- microbenchmark(
> >   "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
> > 0.6),
> >   "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
> > dat_big$study, r = 0.6, return_list = FALSE),
> >   times = 10
> > )
> > summary(res)
> >
> > # benchmark performance with sparse matrix (also takes a minute to run)
> > res_sparse <- microbenchmark(
> >   "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
> > 0.6, sparse = TRUE),
> >   "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
> > dat_big$study, r = 0.6, return_list = FALSE),
> >   times = 10
> > )
> > summary(res_sparse)
> > ```
> >
> > Thanks again,
> >
> >
> > *Tamar Novetsky* *(she/her)*
> > Data Scientist I
> > Eastern Time Zone
> >
> >
> > On Tue, Aug 6, 2024 at 10:20 AM James Pustejovsky <jepusto using gmail.com>
> > wrote:
> >
> >> Hi Tamar,
> >>
> >> The difference in compute time is because of a difference in how the
> >> default output of these functions is structured.
> >> clubSandwich::impute_covariance_matrix() returns a block-diagonal by
> >> default. metafor::vcalc() returns a full (dense) matrix by default. Say
> >> that you have J studies and study j has kj effect sizes. The block-diagonal
> >> matrix has sum(kj^2) entries, whereas the full matrix has sum(kj)^2
> >> entries. If J is large and the kjs are mostly small, this can make for a
> >> really big difference in object size. However, setting the option
> >> vcalc(sparse = TRUE) will return a block-diagonal matrix and should lead to
> >> performance comparable to impute_covariance_matrix().
> >>
> >> Regarding your second question, I'm not sure what might be going on.
> >> Could you provide a reproducible example?
> >>
> >> James
> >>
> >> On Tue, Aug 6, 2024 at 8:20 AM Tamar Novetsky via R-sig-meta-analysis <
> >> r-sig-meta-analysis using r-project.org> wrote:
> >>
> >>> Hello,
> >>>
> >>> I am working on a script to run multiple meta-regressions on different
> >>> subsets of the same dataset, and have been
> >>> using clubSandwich::impute_covariance_matrix() to generate the
> >>> variance-covariance matrix necessary as an input to metafor::rma.mv().
> >>> However, I recently learned that impute_covariance_matrix() has been
> >>> superseded by metafor::vcalc(), so I have been working to replace my
> >>> usage
> >>> of the former function with the latter. In that process, I discovered
> >>> that
> >>> vcalc() seems to be much slower than impute_covariance_matrix() - about
> >>> 150x slower in one use case that I benchmarked using the microbenchmark
> >>> package. Since I will be running this many times in a loop, performance
> >>> matters quite a lot to me in this context.
> >>>
> >>> Can anyone help me understand why vcalc() would be so much slower? Is it
> >>> possible that I'm using it incorrectly?
> >>>
> >>> Secondly/possibly relatedly, I found that the results from vcalc() are
> >>> always either exactly the same or exactly double the results from
> >>> impute_covariance_matrix(). Does anyone have a sense of why that would
> >>> be?
> >>> Could that be related to the performance differences?
> >>>
> >>> Thanks so much for your help,
> >>>
> >>> *Tamar Novetsky* *(she/her)*
> >>> Data Scientist I
> >>> Eastern Time Zone