[R-meta] Influential case diagnostics in a multivariate multilevel meta-analysis in metafor

Fri Jan 18 16:15:40 CET 2019

Hi Wolfgang,

Using the latest 'devel' version of metafor worked! It took the computation
about 10 minutes to run with 4 parallel cores (number of cores was indeed
determined using the 'parallel' package).

Thanks for all your help!
Yogev

--

Yogev Kivity, Ph.D.
Postdoctoral Fellow
Department of Psychology
The Pennsylvania State University
Bruce V. Moore Building
University Park, PA 16802
Office Phone: (814) 867-2330

On Thu, Jan 17, 2019 at 5:16 PM Viechtbauer, Wolfgang (SP) <
wolfgang.viechtbauer using maastrichtuniversity.nl> wrote:

> Hi Yogev,
>
> Just to be safe, make sure you are using the latest 'devel' version of
> metafor. Run devtools::install_github("wviechtb/metafor") to be sure. Also,
> I would go with whatever detectCores(logical=FALSE) tells you for the
> number of cores. But even without that, things should finish in a few
> minutes. Beyond that, I really don't know what the issue could be. It
> certainly isn't an issue with metafor per se.
>
> Best,
> Wolfgang
>
> -----Original Message-----
> From: Yogev Kivity [mailto:yogev_k using yahoo.com]
> Sent: Thursday, 17 January, 2019 21:37
> To: Viechtbauer, Wolfgang (SP)
> Cc: Martineau, Roger (AAFC/AAC); R-sig-meta-analysis using r-project.org
> Subject: Re: [R-meta] Influential case diagnostics in a multivariate
> multilevel meta-analysis in metafor
>
> Hi Wolfgang,
>
> Thanks for your detailed reply and suggestions. Unfortunately, even after
> implementing your suggestions, I could not get the computation to terminate
> after letting it run for the night (with 4 logical cores).
>
> I was going to suggest that perhaps the unbalanced dataset I am working
> with compared to the konstantopoulos2011 data has something to do with it
> (cluster size in my dataset ranges between 1 and 234 effect sizes with a
> mean of 11 and a median of 5). However, when I tried to run the
> konstantopoulos2011 code, I got similar running times for fitting the
> models (using standard BLAS), but I could not get the Cook’s distances
> computation to terminate even after 2050 seconds – even when I used
> parallel processing with 4 logical cores. I used this code:
>
> system.time(sav2 <- cooks.distance(res2, cluster=dat$group,
> reestimate=FALSE, parallel="snow", ncpus=4))
>
> Any thoughts?
>
> Thanks,
> Yogev
> --
> Yogev Kivity, Ph.D.
> Postdoctoral Fellow
> Department of Psychology
> The Pennsylvania State University
> Bruce V. Moore Building
> University Park, PA 16802
> Office Phone: (814) 867-2330
>
> On Thu, Jan 17, 2019 at 4:24 AM Viechtbauer, Wolfgang (SP) <
> wolfgang.viechtbauer using maastrichtuniversity.nl> wrote:
> Please keep the mailing list in cc.
>
> I don't know what model you are fitting, but with k=820, that running time
> seems excessive. Here is an artificial example with k=2800. I just use the
> data from 'dat.konstantopoulos2011' and replicate them 50 times to create a
> much larger dataset. I then fit a multilevel model with group
> (replication), district, and school as random effects. First, I use the
> defaults and then sparse=TRUE, since that should help quite a bit here.
> Also, I once run things with the standard BLAS routines and once with
> OpenBLAS (switching those routines requires making system changes, not
> something that can be done within R).
>
> ###########################
>
> library(metafor)
>
> dat <- dat.konstantopoulos2011
> group <- rep(1:nrow(dat), each=50)
> dat <- dat[group,]
> dat$group <- group
> rm(group)
> nrow(dat)
>
> system.time(res1 <- rma.mv(yi, vi, random = ~ 1 | group/district/school,
> data=dat))
>
> system.time(res2 <- rma.mv(yi, vi, random = ~ 1 | group/district/school,
> data=dat, sparse=TRUE))
>
> system.time(sav1 <- cooks.distance(res2, cluster=dat$group,
> reestimate=FALSE))
>
> ###### results:
>
> ### with standard BLAS
>
> > system.time(res1 <- rma.mv(yi, vi, random = ~ 1 |
> group/district/school, data=dat))
>    user  system elapsed
> 683.587   8.712 692.312
> >
> > system.time(res2 <- rma.mv(yi, vi, random = ~ 1 |
> group/district/school, data=dat, sparse=TRUE))
>    user  system elapsed
>   8.292   0.600   8.894
>
> > system.time(sav <- cooks.distance(res2, cluster=dat$group,
> reestimate=FALSE))
>    user  system elapsed
> 270.960   0.044 271.005
>
> ### with OpenBLAS
>
> > system.time(res1 <- rma.mv(yi, vi, random = ~ 1 |
> group/district/school, data=dat))
>    user  system elapsed
>  86.531   8.707  95.242
> >
> > system.time(res2 <- rma.mv(yi, vi, random = ~ 1 |
> group/district/school, data=dat, sparse=TRUE))
>    user  system elapsed
>   6.476   0.632   7.108
>
> > system.time(sav1 <- cooks.distance(res2, cluster=dat$group,
> reestimate=FALSE))
>    user  system elapsed
> 148.071   0.060 148.133
>
> ###########################
>
> So, with the defaults and standard BLAS, fitting that model takes 11.5
> minutes, which is a bit painful (esp. if you then would compute the Cook's
> distances). Using sparse=TRUE brings this down to 9 seconds. Computing the
> 'group' level Cook's distances (using reestimate=FALSE, so really they are
> approximations, but usually good enough for diagnostic purposes) takes 4.5
> minutes, which does require you to grab a cup of coffee and have a quick
> chat with a colleague at the coffee machine, but that isn't such a bad
> thing.
>
> Switching to OpenBLAS helps esp. when using the defaults (now about 1.5
> minutes). Using sparse=TRUE brings the time down to 7 seconds and the
> Cook's distances are then computed in about 2.5 minutes. That only leaves
> time to grab coffee and say hi to your colleague.
>
> I did not use any multicore processing here, so if you use 2 cores, you
> can pretty much half the time to compute the Cook's distances (there is a
> bit of overhead when using multicore processing, but that should be minor
> here).
>
> So, while rma.mv() isn't super fast, I am wondering why your (and
> Yogev's) running times are so long.
>
> Best,
> Wolfgang
>
> -----Original Message-----
> From: Martineau, Roger (AAFC/AAC) [mailto:roger.martineau using canada.ca]
> Sent: Wednesday, 16 January, 2019 19:21
> To: Viechtbauer, Wolfgang (SP)
> Subject: [R-meta] Influential case diagnostics in a multivariate
> multilevel meta-analysis in metafor
>
> Dear Wolfgang,
>
> I have exactly the same problem as Dr. Kivity and have not been able to
> solve it yet due to the size of the data set I presume (n = 820). I have to
> let Cook’s distance run overnight and it is a real pain.
>
> I checked the number of cores available (see below). Are they sufficient ?
>
> > library(nat.utils)
> > ncpus()
> [1] 4
>
> > library(parallel)
> > detectCores(logical=FALSE)
> [1] 2
>
> This is one very frustrating issue with rma.mv, because I can fit a
> multilevel model using the lmer function (I know using rma.mv is more
> appropriate in a meta-analytic context) and will get Cook’s distance values
> a lot faster with the following:
>
> > library(influence.ME)
> > infl <- influence(NoMods, obs = TRUE)
> > plot(infl, which = "cook")
> > tmp.cook <- cooks.distance(infl)
> > plot(infl, which = "cook")
> > which(tmp.cook > 0.5)
> [1] 642
>
> Indeed, Cook’s distance values are not exactly the same using the rma.mv
> and the lmer function but large values should be detected using both
> functions.
>
> Best regards,
>
> Roger ☺
>
> S.V.P. notez ma nouvelle adresse courriel ci-bas
> Please note my new email address below
>
> Roger Martineau, mv Ph.D.
> Nutrition et Métabolisme des ruminants
> Centre de recherche et de développement
> sur le bovin laitier et le porc
> Agriculture et agroalimentaire Canada/Agriculture and Agri-Food Canada
> Téléphone/Telephone: 819-780-7319
> Télécopieur/Facsimile: 819-564-5507
> 2000, Rue Collège / 2000, College Street
> Sherbrooke (Québec)  J1M 0C8
> Canada
> roger.martineau using canada.ca
>
> Dear Yogev,
>
> Since you use 'cluster=StudyID', cooks.distance() is doing 311 model fits.
> But you use 'reestimate=FALSE', which should speed things up a lot. Also,
> 'sparse=TRUE' probably makes a lot of sense here, since the marginal
> var-cov structure is probably quite sparse. So, for the most part, you are
> already using features that should help to speed things up.
>
> But a few things:
>
> 1) You used 'cluster = StudyID', but unless you used attach(Data) or have
> 'StudyID' as a separate object in your workspace, this should not work. It
> should be 'cluster = Data$StudyID'.
>
> 2) If you use 'parallel="snow"', then no progress bar will be shown, so I
> wonder how you got the '6%' then. Or did you run this once without
> 'parallel="snow"'?
>
> 3) If you use 'parallel="snow"', then this won't give you any speed
> increase unless you actually make use of multiple cores. You can do this
> with the 'ncpus' argument. But first check how many cores you actually have
> available with parallel::detectCores() Note that this also counts 'logical'
> cores. If you are on MacOS or Windows, then detectCores(logical=FALSE) is a
> better indicator of how many cores to specify under 'ncpus'.
>
> Best,
> Wolfgang
>
> >-----Original Message-----
> >From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces using r-
> >project.org] On Behalf Of Yogev Kivity
> >Sent: Tuesday, 15 January, 2019 21:20
> >To: r-sig-meta-analysis using r-project.org
> >Subject: [R-meta] Influential case diagnostics in a multivariate
> >multilevel meta-analysis in metafor
> >
> >Hi all,
> >
> >I am fitting a multivariate multilevel meta-analysis in metafor and
> >having
> >trouble computing outlier and influential case diagnostics (i.e., cook’s
> >distances per
> >https://wviechtb.github.io/metafor/reference/influence.rma.mv.html).
> >
> >This a large dataset of 3360 Pearson’s correlations (converted to
> >Fisher’s
> >z) nested within 600 subsamples that are nested within 311 studies. Below
> >is the code I used for the model and for computing Cook’s distances, and
> >the problem is that it takes it a lot of time to run (I ran it overnight
> >and it only reached 6%). I am assuming it is related to the size of the
> >dataset and to the complex model structure, but I am not sure how to go
> >about and speed up the processing. I should note that I am computing the
> >distances based on the simplest possible model (i.e., no moderators and
> >without considering dependencies among effect sizes within clusters).
> >
> >I was hoping someone could help with some suggestions of how best to move
> >forward.
> >
> >Thanks,
> >Yogev
> >
> >NoMods <- rma.mv(yi, vi, random = ~ 1 | StudyID/GroupID/EffectSizeID,
> >data=Data,sparse=TRUE)
> >summary(NoMods)
> >NoModsCooksDistance <- cooks.distance(NoMods,progbar = T,cluster =
> >StudyID,
> >reestimate=FALSE,parallel="snow")
> >NoModsCooksDistance
> >plot(NoModsCooksDistance, type="o", pch=19)
> >
> >--
> >
> >Yogev Kivity, Ph.D.
> >Postdoctoral Fellow
> >Department of Psychology
> >The Pennsylvania State University
> >Bruce V. Moore Building
> >University Park, PA 16802
> >Office Phone: (814) 867-2330
>

	[[alternative HTML version deleted]]