[R-meta] cooks.distance.rma.mv is slow on complex models

Tue Jul 25 19:54:23 CEST 2017

Hi Roger,

In order to compute Cook's distance for the ith estimate, we have to refit the model with the ith estimate removed. For simple models like those fit by lm(), there are shortcuts that we can use to avoid actually having to refit the model. See:

https://en.wikipedia.org/wiki/Cook%27s_distance

For more complex models, those shortcuts do not exist (at least, I am not aware of them). So, right now, cooks.distance() crunches through k model fits (k being the number of estimates), which can take a considerable amount of time for complex models and large datasets. Yes, that kind of sucks.

On my to-do list is to add parallel processing to cooks.distance(). profile() and gosh() are similar functions that do repeated model fits and those already can do parallel processing. My plan is to add this to cooks.distance(). At least then, you can cut down on the time if you have a computer with multiple cores.

In the meantime, you could give Microsoft R Open (MRO) a try. There is nothing fundamentally different about MRO, except that it directly comes with Intel's Math Kernel Library (MKL) and that is likely to speed up model fitting quite a bit. See:

http://www.metafor-project.org/doku.php/tips:speeding_up_model_fitting

Alternatively, you could consider computing a sort of 'pseudo Cook's distance' based on the same idea as how Cook's distance is computed for standard linear models. In essence, it's just some kind of squared standardized residual multiplied by h/(1-h)^2, where h is the hatvalue. A simple example:

dat <- get(data(dat.bcg))
dat <- escalc(measure="RR", ai=tpos, bi=tneg, ci=cpos, di=cneg, data=dat.bcg)
res <- rma(yi, vi, data=dat)

sav1 <- cooks.distance(res)
sav2 <- rstandard(res)$z^2 * hatvalues(res) / (1-hatvalues(res))^2

plot(sav1, sav2)
cor(sav1, sav2)

Not exactly the same thing, but quite close. A more complex example:

dat <- get(data(dat.konstantopoulos2011))
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)

sav1 <- cooks.distance(res)
sav2 <- rstandard(res)$z^2 * hatvalues(res) / (1-hatvalues(res))^2

plot(sav1, sav2)
cor(sav1, sav2)

Okay, now things start to not align so nicely anymore, but maybe still okay as a rough check.

Best,
Wolfgang

-- 
Wolfgang Viechtbauer, Ph.D., Statistician | Department of Psychiatry and    
Neuropsychology | Maastricht University | P.O. Box 616 (VIJV1) | 6200 MD    
Maastricht, The Netherlands | +31 (43) 388-4170 | http://www.wvbauer.com    

-----Original Message-----
From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces at r-project.org] On Behalf Of Martineau, Roger
Sent: Tuesday, July 25, 2017 18:56
To: r-sig-meta-analysis at r-project.org
Subject: [R-meta] cooks.distance.rma.mv is slow on complex models

Dear metafor users,

I am running complex models using rma.mv function of metafor (metafor_2.0-0; R version 3.4.0) on a large dataset (n = 197 studies) with a 4-level hierarchical structure of data.

#### Model ####
(tmp.casdiet <- rma.mv(MTPYMean, MTPYSEMtrDP^2, data=tmp.dat.MTPY.new,
                      mods = ~
                        cMPbal +
                        cNELbal +
                        cMPsupply.kg +
                        factor(CasDiet)*cMPsupply.kg,
                      random = ~1|laboratory/experiment/study,
                      method = "REML", sparse=TRUE))

All metafor functions run well except cooks.distance.rma.mv which is really slowing my work. I have a new computer that should be up to the task.

 The time spent to execute the task  is related to the random statement:
• random = ~1|laboratory/experiment/study : 14 min 5 sec
• random = ~1|experiment/study : 8 min 10 sec
• random = ~1|study: 1 min 12 sec

Find below more info on variance components and Cook’s distance graphs. 

In this model it wouldn’t make much of a difference to run the cooks.distance.rma.mv function with random = ~1|study but it is not the correct way to detect influential cases. 

Is there another way to speed up the process ?

Thanks in advance,

Roger Martineau ☺

The variance components and the associated Cook’s distance graphs are:

Variance Components: 

              estim     sqrt  nlvls  fixed
sigma^2.1  178.9385  13.3768     21     no
sigma^2.2  409.8262  20.2442     47     no
sigma^2.3    0.0000   0.0016     69     no
                                factor
sigma^2.1                   laboratory
sigma^2.2        laboratory/experiment
sigma^2.3  laboratory/experiment/study

#### Cook's distance ####
par(mfrow=c(1,1))
tmp.cook <- cooks.distance.rma.mv(tmp.casdiet, progbar=TRUE)
plot(tmp.cook, type="o", pch=19)
which(tmp.cook > 1)

• random = ~1|laboratory/experiment/study

• random = ~1|experiment/study

• random = ~1|study