[R-meta] Influential case diagnostics in a multivariate multilevel meta-analysis in metafor
Viechtbauer, Wolfgang (SP)
wo||g@ng@v|echtb@uer @end|ng |rom m@@@tr|chtun|ver@|ty@n|
Thu Jan 17 10:24:27 CET 2019
Please keep the mailing list in cc.
I don't know what model you are fitting, but with k=820, that running time seems excessive. Here is an artificial example with k=2800. I just use the data from 'dat.konstantopoulos2011' and replicate them 50 times to create a much larger dataset. I then fit a multilevel model with group (replication), district, and school as random effects. First, I use the defaults and then sparse=TRUE, since that should help quite a bit here. Also, I once run things with the standard BLAS routines and once with OpenBLAS (switching those routines requires making system changes, not something that can be done within R).
###########################
library(metafor)
dat <- dat.konstantopoulos2011
group <- rep(1:nrow(dat), each=50)
dat <- dat[group,]
dat$group <- group
rm(group)
nrow(dat)
system.time(res1 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat))
system.time(res2 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat, sparse=TRUE))
system.time(sav1 <- cooks.distance(res2, cluster=dat$group, reestimate=FALSE))
###### results:
### with standard BLAS
> system.time(res1 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat))
user system elapsed
683.587 8.712 692.312
>
> system.time(res2 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat, sparse=TRUE))
user system elapsed
8.292 0.600 8.894
> system.time(sav <- cooks.distance(res2, cluster=dat$group, reestimate=FALSE))
user system elapsed
270.960 0.044 271.005
### with OpenBLAS
> system.time(res1 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat))
user system elapsed
86.531 8.707 95.242
>
> system.time(res2 <- rma.mv(yi, vi, random = ~ 1 | group/district/school, data=dat, sparse=TRUE))
user system elapsed
6.476 0.632 7.108
> system.time(sav1 <- cooks.distance(res2, cluster=dat$group, reestimate=FALSE))
user system elapsed
148.071 0.060 148.133
###########################
So, with the defaults and standard BLAS, fitting that model takes 11.5 minutes, which is a bit painful (esp. if you then would compute the Cook's distances). Using sparse=TRUE brings this down to 9 seconds. Computing the 'group' level Cook's distances (using reestimate=FALSE, so really they are approximations, but usually good enough for diagnostic purposes) takes 4.5 minutes, which does require you to grab a cup of coffee and have a quick chat with a colleague at the coffee machine, but that isn't such a bad thing.
Switching to OpenBLAS helps esp. when using the defaults (now about 1.5 minutes). Using sparse=TRUE brings the time down to 7 seconds and the Cook's distances are then computed in about 2.5 minutes. That only leaves time to grab coffee and say hi to your colleague.
I did not use any multicore processing here, so if you use 2 cores, you can pretty much half the time to compute the Cook's distances (there is a bit of overhead when using multicore processing, but that should be minor here).
So, while rma.mv() isn't super fast, I am wondering why your (and Yogev's) running times are so long.
Best,
Wolfgang
-----Original Message-----
From: Martineau, Roger (AAFC/AAC) [mailto:roger.martineau using canada.ca]
Sent: Wednesday, 16 January, 2019 19:21
To: Viechtbauer, Wolfgang (SP)
Subject: [R-meta] Influential case diagnostics in a multivariate multilevel meta-analysis in metafor
Dear Wolfgang,
I have exactly the same problem as Dr. Kivity and have not been able to solve it yet due to the size of the data set I presume (n = 820). I have to let Cook’s distance run overnight and it is a real pain.
I checked the number of cores available (see below). Are they sufficient ?
> library(nat.utils)
> ncpus()
[1] 4
> library(parallel)
> detectCores(logical=FALSE)
[1] 2
This is one very frustrating issue with rma.mv, because I can fit a multilevel model using the lmer function (I know using rma.mv is more appropriate in a meta-analytic context) and will get Cook’s distance values a lot faster with the following:
> library(influence.ME)
> infl <- influence(NoMods, obs = TRUE)
> plot(infl, which = "cook")
> tmp.cook <- cooks.distance(infl)
> plot(infl, which = "cook")
> which(tmp.cook > 0.5)
[1] 642
Indeed, Cook’s distance values are not exactly the same using the rma.mv and the lmer function but large values should be detected using both functions.
Best regards,
Roger ☺
S.V.P. notez ma nouvelle adresse courriel ci-bas
Please note my new email address below
Roger Martineau, mv Ph.D.
Nutrition et Métabolisme des ruminants
Centre de recherche et de développement
sur le bovin laitier et le porc
Agriculture et agroalimentaire Canada/Agriculture and Agri-Food Canada
Téléphone/Telephone: 819-780-7319
Télécopieur/Facsimile: 819-564-5507
2000, Rue Collège / 2000, College Street
Sherbrooke (Québec) J1M 0C8
Canada
roger.martineau using canada.ca
Dear Yogev,
Since you use 'cluster=StudyID', cooks.distance() is doing 311 model fits. But you use 'reestimate=FALSE', which should speed things up a lot. Also, 'sparse=TRUE' probably makes a lot of sense here, since the marginal var-cov structure is probably quite sparse. So, for the most part, you are already using features that should help to speed things up.
But a few things:
1) You used 'cluster = StudyID', but unless you used attach(Data) or have 'StudyID' as a separate object in your workspace, this should not work. It should be 'cluster = Data$StudyID'.
2) If you use 'parallel="snow"', then no progress bar will be shown, so I wonder how you got the '6%' then. Or did you run this once without 'parallel="snow"'?
3) If you use 'parallel="snow"', then this won't give you any speed increase unless you actually make use of multiple cores. You can do this with the 'ncpus' argument. But first check how many cores you actually have available with parallel::detectCores() Note that this also counts 'logical' cores. If you are on MacOS or Windows, then detectCores(logical=FALSE) is a better indicator of how many cores to specify under 'ncpus'.
Best,
Wolfgang
>-----Original Message-----
>From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces using r-
>project.org] On Behalf Of Yogev Kivity
>Sent: Tuesday, 15 January, 2019 21:20
>To: r-sig-meta-analysis using r-project.org
>Subject: [R-meta] Influential case diagnostics in a multivariate
>multilevel meta-analysis in metafor
>
>Hi all,
>
>I am fitting a multivariate multilevel meta-analysis in metafor and
>having
>trouble computing outlier and influential case diagnostics (i.e., cook’s
>distances per
>https://wviechtb.github.io/metafor/reference/influence.rma.mv.html).
>
>This a large dataset of 3360 Pearson’s correlations (converted to
>Fisher’s
>z) nested within 600 subsamples that are nested within 311 studies. Below
>is the code I used for the model and for computing Cook’s distances, and
>the problem is that it takes it a lot of time to run (I ran it overnight
>and it only reached 6%). I am assuming it is related to the size of the
>dataset and to the complex model structure, but I am not sure how to go
>about and speed up the processing. I should note that I am computing the
>distances based on the simplest possible model (i.e., no moderators and
>without considering dependencies among effect sizes within clusters).
>
>I was hoping someone could help with some suggestions of how best to move
>forward.
>
>Thanks,
>Yogev
>
>NoMods <- rma.mv(yi, vi, random = ~ 1 | StudyID/GroupID/EffectSizeID,
>data=Data,sparse=TRUE)
>summary(NoMods)
>NoModsCooksDistance <- cooks.distance(NoMods,progbar = T,cluster =
>StudyID,
>reestimate=FALSE,parallel="snow")
>NoModsCooksDistance
>plot(NoModsCooksDistance, type="o", pch=19)
>
>--
>
>Yogev Kivity, Ph.D.
>Postdoctoral Fellow
>Department of Psychology
>The Pennsylvania State University
>Bruce V. Moore Building
>University Park, PA 16802
>Office Phone: (814) 867-2330
More information about the R-sig-meta-analysis
mailing list