[R-meta] possible miscalculation of Cook’s distances
Viechtbauer, Wolfgang (SP)
wo||g@ng@v|echtb@uer @end|ng |rom m@@@tr|chtun|ver@|ty@n|
Fri Feb 4 16:16:51 CET 2022
Just as a brief follow-up with respect to the calculations: They are correct. Continuing with your code, Antonina, we can easily manually compute the Cook's distances here:
cooksd[1:2]
sub <- rma.mv(yi, vi,
random = ~ 1 | study/experiment,
data=dat, subset=-1)
(coef(res.ml) - coef(sub))^2 / vcov(res.ml)
sub <- rma.mv(yi, vi,
random = ~ 1 | study/experiment,
data=dat, subset=-2)
(coef(res.ml) - coef(sub))^2 / vcov(res.ml)
As you will see, these match up.
With these data, removal of a single row can have a large impact on the variance components, which can lead to considerable changes in the pooled estimate.
Best,
Wolfgang
>-----Original Message-----
>From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces using r-project.org] On
>Behalf Of Reza Norouzian
>Sent: Friday, 04 February, 2022 8:01
>To: Antonina Dolgorukova
>Cc: R meta
>Subject: Re: [R-meta] possible miscalculation of Cook’s distances
>
>Dear Antonina,
>
>There was no rationale, I just wanted to indicate that there is no bug
>in the function (and save myself a tiny bit of time). The default
>reestimate = TRUE is certainly to be preferred given that your model
>does include random-effects and, as the documentation correctly
>mentions, the influence of each effect estimate on the estimates of
>heterogeneity and correlation can only be examined, if you set
>reestimate = TRUE.
>
>For instance, if you remove the random-effects from your initial model
>(res.ml), then, you'll see that the use of reestimate = TRUE or FALSE
>has no effect. In both cases, only the effect estimate associated with
>study 1, experiment 1 is influential. Thus, this shows that the use of
>reestimate = FALSE in your initial model essentially ignored the
>influence of each effect estimate on the estimates of heterogeneity.
>
>In models with complex random-effects structure esp. fit to large
>datasets, setting reestimate = TRUE would take a good chunk of time.
>In such cases, some (myself included) may be tempted to set reestimate
>= FALSE hoping that the approximation will be close enough.
>
>Kind regards,
>Reza
>
>res.ml2 <- update.rma(res.ml, random = NULL)
>
>cook_with_reest2 <- cooks.distance.rma.mv(res.ml2)
>cook_without_reest2 <- cooks.distance.rma.mv(res.ml2, reestimate = FALSE)
>
>plot(cook_with_reest2,type="b")
>plot(cook_without_reest2,type="b")
>
>On Fri, Feb 4, 2022 at 12:19 AM Antonina Dolgorukova
><an.dolgorukova using gmail.com> wrote:
>>
>> Dear Reza,
>>
>> Thank you very much for your reply. I do understand that not all outliers have
>to be influential and that not all influential cases have to be outliers. But
>thank you for mentioning this and for the code provided. I did notice you set the
>reestimate = FALSE. This raises two questions, actually. It would be great if you
>help with them.
>>
>> 1) Is there any explanation why cooks.distance.rma.mv(res.ml) and
>cooks.distance.rma.mv(res.ml , reestimate = FALSE) give completely different
>results about the experiments 1 and 2 (and very similar for the remaining
>experiments)?
>>
>> According to cooks.distance.rma.mv(res.ml), the Study 1, Experiment 2 is
>influential, but
>> according to cooks.distance.rma.mv(res.ml , reestimate = FALSE), Study 1,
>Experiment 1 is influential
>>
>>
>> #the code illustrating my question
>>
>> dat <- data.frame(study=c(1,1,2,3,3,3), experiment=c(1:6),
>> yi=c( 68, 18, 31,20,10,26),
>> vi=c(60,32, 15, 19, 41, 82))
>>
>> res.ml <- rma.mv(yi, vi,
>> random = ~ 1 | study/experiment,
>> data=dat,
>> slab = paste("Study ", study,", ", "Experiment ", experiment,
>sep = ""))
>>
>> cook_with_reest <- cooks.distance.rma.mv(res.ml)
>> cook_without_reest <- cooks.distance.rma.mv(res.ml, reestimate = FALSE)
>>
>> plot(1:length(cook_with_reest), cook_with_reest, ylim=c(0,2), type="o", pch=19,
>las=1, xaxt="n",
>> xlab="Study and Experiment ID", ylab="Cook's Distance")
>> points(cook_without_reest, type="o", pch=19, col = "red")
>> axis(1, 1:length(cook_with_reest), labels=names(cook_with_reest))
>> abline(h=4/length(cook_with_reest), lty="dotted", lwd=2)
>> legend("top", pch=19, col=c("black","red"), lty="solid",
>> legend=c("reestimate = TRUE","reestimate = FALSE"), bty="n")
>>
>> 2) And could you please explain what is the rationale to set reestimate =
>FALSE? According to the metafor documentation:
>> "Doing so only yields an approximation to the Cook’s distances that ignores the
>influence of the ith case on the variance/correlation components"
>>
>> Sincerely,
>> Antonina
>>
>> On Fri, Feb 4, 2022 at 3:29 AM Reza Norouzian <rnorouzian using gmail.com> wrote:
>>>
>>> Dear Antonina,
>>>
>>> An effect estimate whose standardized deleted residual falls beyond
>>> +/-1.96 doesn't necessarily need to have a cook's distance
>>> (influence), and/or a hat value (if additional moderators are used)
>>> that are likewise extreme.
>>>
>>> As a general proposition, it may be methodologically more reasonable
>>> to simultaneously consult all these indices to flag an extreme effect
>>> estimate.
>>>
>>> That said, in your case, it does seem that the estimate from study 1,
>>> experiment 1 has a large standardized deleted residual as well as a
>>> large Cook's distance relative to that of other effect estimates. So,
>>> I don't think there is any bug in any of the metafor functions you
>>> used.
>>>
>>> Below is a bit of code to catch that outlying effect estimate.
>>>
>>> Kind regards,
>>> Reza
>>>
>>> dat <- data.frame(study=c(1,1,2,3,3,3), experiment=c(1:6),
>>> yi=c( 68, 18, 31,20,10,26),
>>> vi=c(60,32, 15, 19, 41, 82))
>>>
>>> res.ml <- rma.mv(yi, vi,
>>> random = ~ 1 | study/experiment,
>>> data=dat,
>>> slab = paste("Study ", study,", ", "Experiment ",
>>> experiment, sep = ""))
>>>
>>> dat <- transform(dat, obs = res.ml$slab)
>>>
>>> outlier_limit <- qnorm(.975)
>>>
>>> cook <- cooks.distance.rma.mv(res.ml,
>>> reestimate = FALSE)
>>>
>>> st_del_res_z <- rstudent.rma.mv(res.ml,
>>> reestimate = FALSE)$z
>>>
>>> cook_limit <- max(mean(range(cook)), boxplot.stats(cook, coef = 1.5)$stats[5])
>>>
>>> i <- abs(st_del_res_z) > outlier_limit
>>> k <- cook > cook_limit
>>> oo <- which(i & k)
>>>
>>> LL <- names(oo)
>>>
>>> removed <- subset(dat, obs %in% LL)
>>> new_dat <- subset(dat, !obs %in% LL)
>>>
>>> On Thu, Feb 3, 2022 at 3:36 PM Antonina Dolgorukova
>>> <an.dolgorukova using gmail.com> wrote:
>>> >
>>> > Dear Dr. Viechtbauer and all,
>>> >
>>> > I have a multilevel data structure (experiments nested within studies) are
>>> > use rma.mv to calculate an overall effect estimate. The next step requires
>>> > sensitivity analysis on the experiment-level data. For detecting outliers
>>> > I've used standardized (deleted) residuals and for detecting influential
>>> > experiments I've used Cook’s distance. However, the last test provides
>>> > contradictory results.
>>> >
>>> > The forest plot indicates that the 1st experiment may be an outlier, and
>>> > standardized (deleted) residuals confirm this. But according to Cook’s
>>> > distance plot, the 2nd experiment is influential. It seems that there may
>>> > be a miscalculation of Cook’s distance since I can easily reproduce this
>>> > issue (in one study one of the experiments have to provide a much larger ES
>>> > than the other) also if use the model with random = ~ 1 | experiment, the
>>> > 1st experiment is influential, not the second.
>>> >
>>> > Could you please clarify is this a bug or a feature of the cooks.distance()
>>> > function? Maybe it does not work properly with rma.mv objects?
>>> >
>>> > ## Reproducible Example
>>> >
>>> > # data frame
>>> > dat <- data.frame(study=c(1,1,2,3,3,3), experiment=c(1:6),
>>> > yi=c( 68, 18, 31,20,10,26),
>>> > vi=c(60,32, 15, 19, 41, 82))
>>> >
>>> > # multilevel model
>>> > res.ml <- rma.mv(yi, vi,
>>> > random = ~ 1 | study/experiment,
>>> > data=dat,
>>> > slab = paste("Study ", study,", ", "Experiment ",
>>> > experiment, sep = ""))
>>> >
>>> > # forest plot examination indicates that the 1st experiment may be an
>>> > outlier
>>> > forest(res.ml,
>>> > header = "Study and Experiment ID")
>>> >
>>> > # standardized (deleted) residuals confirm this
>>> > rst <- rstudent(res.ml)
>>> > plot(NA, NA, xlim=c(1, res.ml$k), ylim=c(-3,5),
>>> > xlab="Study and Experiment ID", ylab="Standardized (Deleted) Residual",
>>> > xaxt="n", las=1)
>>> > axis(side=1, at=1:res.ml$k, labels=rst$slab)
>>> > abline(h=c(-1.96,1.96), lty="dotted")
>>> > abline(h=0)
>>> > points(1:res.ml$k, rst$z, type="o", pch=19)
>>> >
>>> > # according to Cook’s distance plot, the 2nd experiment is influential
>>> > cooksd <- cooks.distance(res.ml)
>>> > plot(1:length(cooksd), cooksd, ylim=c(0,2), type="o", pch=19, las=1,
>>> > xaxt="n",
>>> > xlab="Experiment ID", ylab="Cook's Distance")
>>> > axis(1, 1:length(cooksd), labels=names(cooksd))
>>> > abline(h=4/length(cooksd), lty="dotted", lwd=2)
>>> >
>>> > sessionInfo()
>>> > R version 4.1.2 (2021-11-01)
>>> > Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> > Running under: Windows 10 x64 (build 19042)
>>> > [1] metafor_3.1-43 metadat_1.0-0 Matrix_1.3-4
>>> >
>>> > --
>>> > Antonina Dolgorukova, M.D.
>>> > Pavlov First Saint Petersburg State Medical University
>>> > Lev Tolstoy str., 6-8, 197022
>>> > Saint Petersburg, Russia
More information about the R-sig-meta-analysis
mailing list