[R-meta] Influential case diagnostics in a multivariate multilevel meta-analysis in metafor

Thu Jan 17 23:16:45 CET 2019

Hi Yogev,

Just to be safe, make sure you are using the latest 'devel' version of metafor. Run devtools::install_github("wviechtb/metafor") to be sure. Also, I would go with whatever detectCores(logical=FALSE) tells you for the number of cores. But even without that, things should finish in a few minutes. Beyond that, I really don't know what the issue could be. It certainly isn't an issue with metafor per se.

Best,
Wolfgang

-----Original Message-----
From: Yogev Kivity [mailto:yogev_k using yahoo.com] 
Sent: Thursday, 17 January, 2019 21:37
To: Viechtbauer, Wolfgang (SP)
Cc: Martineau, Roger (AAFC/AAC); R-sig-meta-analysis using r-project.org
Subject: Re: [R-meta] Influential case diagnostics in a multivariate multilevel meta-analysis in metafor

Hi Wolfgang,

Thanks for your detailed reply and suggestions. Unfortunately, even after implementing your suggestions, I could not get the computation to terminate after letting it run for the night (with 4 logical cores).

I was going to suggest that perhaps the unbalanced dataset I am working with compared to the konstantopoulos2011 data has something to do with it (cluster size in my dataset ranges between 1 and 234 effect sizes with a mean of 11 and a median of 5). However, when I tried to run the konstantopoulos2011 code, I got similar running times for fitting the models (using standard BLAS), but I could not get the Cook’s distances computation to terminate even after 2050 seconds – even when I used parallel processing with 4 logical cores. I used this code:

system.time(sav2 <- cooks.distance(res2, cluster=dat$group, reestimate=FALSE, parallel="snow", ncpus=4))

Any thoughts?

Thanks,
Yogev
--
Yogev Kivity, Ph.D. 
Postdoctoral Fellow 
Department of Psychology 
The Pennsylvania State University 
Bruce V. Moore Building 
University Park, PA 16802 
Office Phone: (814) 867-2330

On Thu, Jan 17, 2019 at 4:24 AM Viechtbauer, Wolfgang (SP) <wolfgang.viechtbauer using maastrichtuniversity.nl> wrote:
Please keep the mailing list in cc.

I don't know what model you are fitting, but with k=820, that running time seems excessive. Here is an artificial example with k=2800. I just use the data from 'dat.konstantopoulos2011' and replicate them 50 times to create a much larger dataset. I then fit a multilevel model with group (replication), district, and school as random effects. First, I use the defaults and then sparse=TRUE, since that should help quite a bit here. Also, I once run things with the standard BLAS routines and once with OpenBLAS (switching those routines requires making system changes, not something that can be done within R).

###########################

library(metafor)

dat <- dat.konstantopoulos2011
group <- rep(1:nrow(dat), each=50)
dat <- dat[group,]
dat$group <- group
rm(group)
nrow(dat)

system.time(res1 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat))

system.time(res2 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat, sparse=TRUE))

system.time(sav1 <- cooks.distance(res2, cluster=dat$group, reestimate=FALSE))

###### results:

### with standard BLAS

> system.time(res1 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat))
   user  system elapsed 
683.587   8.712 692.312 
> 
> system.time(res2 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat, sparse=TRUE))
   user  system elapsed 
  8.292   0.600   8.894 

> system.time(sav <- cooks.distance(res2, cluster=dat$group, reestimate=FALSE))
   user  system elapsed 
270.960   0.044 271.005 

### with OpenBLAS

> system.time(res1 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat))
   user  system elapsed 
 86.531   8.707  95.242 
> 
> system.time(res2 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat, sparse=TRUE))
   user  system elapsed 
  6.476   0.632   7.108

> system.time(sav1 <- cooks.distance(res2, cluster=dat$group, reestimate=FALSE))
   user  system elapsed 
148.071   0.060 148.133

###########################

So, with the defaults and standard BLAS, fitting that model takes 11.5 minutes, which is a bit painful (esp. if you then would compute the Cook's distances). Using sparse=TRUE brings this down to 9 seconds. Computing the 'group' level Cook's distances (using reestimate=FALSE, so really they are approximations, but usually good enough for diagnostic purposes) takes 4.5 minutes, which does require you to grab a cup of coffee and have a quick chat with a colleague at the coffee machine, but that isn't such a bad thing.

Switching to OpenBLAS helps esp. when using the defaults (now about 1.5 minutes). Using sparse=TRUE brings the time down to 7 seconds and the Cook's distances are then computed in about 2.5 minutes. That only leaves time to grab coffee and say hi to your colleague.

I did not use any multicore processing here, so if you use 2 cores, you can pretty much half the time to compute the Cook's distances (there is a bit of overhead when using multicore processing, but that should be minor here).

So, while rma.mv() isn't super fast, I am wondering why your (and Yogev's) running times are so long.

Best,
Wolfgang

-----Original Message-----
From: Martineau, Roger (AAFC/AAC) [mailto:roger.martineau using canada.ca] 
Sent: Wednesday, 16 January, 2019 19:21
To: Viechtbauer, Wolfgang (SP)
Subject: [R-meta] Influential case diagnostics in a multivariate multilevel meta-analysis in metafor

Dear Wolfgang,

I have exactly the same problem as Dr. Kivity and have not been able to solve it yet due to the size of the data set I presume (n = 820). I have to let Cook’s distance run overnight and it is a real pain. 

I checked the number of cores available (see below). Are they sufficient ?

> library(nat.utils)
> ncpus()
[1] 4

> library(parallel)
> detectCores(logical=FALSE)
[1] 2

This is one very frustrating issue with rma.mv, because I can fit a multilevel model using the lmer function (I know using rma.mv is more appropriate in a meta-analytic context) and will get Cook’s distance values a lot faster with the following:

> library(influence.ME)
> infl <- influence(NoMods, obs = TRUE)
> plot(infl, which = "cook")
> tmp.cook <- cooks.distance(infl)
> plot(infl, which = "cook")
> which(tmp.cook > 0.5)
[1] 642

Indeed, Cook’s distance values are not exactly the same using the rma.mv and the lmer function but large values should be detected using both functions.

Best regards,

Roger ☺

S.V.P. notez ma nouvelle adresse courriel ci-bas
Please note my new email address below

Roger Martineau, mv Ph.D.
Nutrition et Métabolisme des ruminants
Centre de recherche et de développement
sur le bovin laitier et le porc
Agriculture et agroalimentaire Canada/Agriculture and Agri-Food Canada
Téléphone/Telephone: 819-780-7319
Télécopieur/Facsimile: 819-564-5507
2000, Rue Collège / 2000, College Street
Sherbrooke (Québec)  J1M 0C8
Canada
roger.martineau using canada.ca

Dear Yogev,

Since you use 'cluster=StudyID', cooks.distance() is doing 311 model fits. But you use 'reestimate=FALSE', which should speed things up a lot. Also, 'sparse=TRUE' probably makes a lot of sense here, since the marginal var-cov structure is probably quite sparse. So, for the most part, you are already using features that should help to speed things up.

But a few things:

1) You used 'cluster = StudyID', but unless you used attach(Data) or have 'StudyID' as a separate object in your workspace, this should not work. It should be 'cluster = Data$StudyID'.

2) If you use 'parallel="snow"', then no progress bar will be shown, so I wonder how you got the '6%' then. Or did you run this once without 'parallel="snow"'?

3) If you use 'parallel="snow"', then this won't give you any speed increase unless you actually make use of multiple cores. You can do this with the 'ncpus' argument. But first check how many cores you actually have available with parallel::detectCores() Note that this also counts 'logical' cores. If you are on MacOS or Windows, then detectCores(logical=FALSE) is a better indicator of how many cores to specify under 'ncpus'.

Best,
Wolfgang

>-----Original Message-----
>From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces using r-
>project.org] On Behalf Of Yogev Kivity
>Sent: Tuesday, 15 January, 2019 21:20
>To: r-sig-meta-analysis using r-project.org
>Subject: [R-meta] Influential case diagnostics in a multivariate
>multilevel meta-analysis in metafor
>
>Hi all,
>
>I am fitting a multivariate multilevel meta-analysis in metafor and
>having
>trouble computing outlier and influential case diagnostics (i.e., cook’s
>distances per
>https://wviechtb.github.io/metafor/reference/influence.rma.mv.html).
>
>This a large dataset of 3360 Pearson’s correlations (converted to
>Fisher’s
>z) nested within 600 subsamples that are nested within 311 studies. Below
>is the code I used for the model and for computing Cook’s distances, and
>the problem is that it takes it a lot of time to run (I ran it overnight
>and it only reached 6%). I am assuming it is related to the size of the
>dataset and to the complex model structure, but I am not sure how to go
>about and speed up the processing. I should note that I am computing the
>distances based on the simplest possible model (i.e., no moderators and
>without considering dependencies among effect sizes within clusters).
>
>I was hoping someone could help with some suggestions of how best to move
>forward.
>
>Thanks,
>Yogev
>
>NoMods <- rma.mv(yi, vi, random = ~ 1 | StudyID/GroupID/EffectSizeID,
>data=Data,sparse=TRUE)
>summary(NoMods)
>NoModsCooksDistance <- cooks.distance(NoMods,progbar = T,cluster =
>StudyID,
>reestimate=FALSE,parallel="snow")
>NoModsCooksDistance
>plot(NoModsCooksDistance, type="o", pch=19)
>
>--
>
>Yogev Kivity, Ph.D.
>Postdoctoral Fellow
>Department of Psychology
>The Pennsylvania State University
>Bruce V. Moore Building
>University Park, PA 16802
>Office Phone: (814) 867-2330