[R-meta] possible miscalculation of Cook’s distances

Sat Feb 5 14:58:46 CET 2022

This is due to the way the estimates are weighted, which is a consequence of the variance components. In the full data, all estimates receive roughly equal weights:

weights(res.ml, type="rowsum")

On the other hand, when you remove experiments 1 or 2, the VCs change such that the weights also change noticeably:

weights(sub, type="rowsum")

Now, experiment 1 or 2 (whichever is still in the subset) receives more weight and so does experiment 3. So, removal of experiment 2 leads to more weight for experiment 1 which also has an extreme estimate, which affects the pooled estimate considerably. Hence, the large Cook's distance.

Best,
Wolfgang

>-----Original Message-----
>From: Antonina Dolgorukova [mailto:an.dolgorukova using gmail.com]
>Sent: Saturday, 05 February, 2022 9:55
>To: Viechtbauer, Wolfgang (SP)
>Cc: Reza Norouzian; R meta
>Subject: Re: [R-meta] possible miscalculation of Cook’s distances
>
>Dear Wolfgang,
>
>Thank you very much for confirming the calculations!! Tha's great that they are
>correct! Actually, by "miscalculation", I have rather meant the order, not the
>numbers.
>
>The thing that confuses me a lot is that according to forest plot examination and
>standardized (deleted) residuals, the 1st experiment is an outlier. But according
>to Cook’s distance plot, the 2nd experiment is influential. Looking at the forest
>plot, it is really hard to believe that the removal of the 2nd experiment can
>affect the pooled estimate (its mean is close to the overall effect estimate and
>CI does overlap (almost included) with CI of the overall effect).
>
>Sincerely,
>Antonina
>
>On Fri, Feb 4, 2022 at 7:17 PM Viechtbauer, Wolfgang (SP)
><wolfgang.viechtbauer using maastrichtuniversity.nl> wrote:
>Just as a brief follow-up with respect to the calculations: They are correct.
>Continuing with your code, Antonina, we can easily manually compute the Cook's
>distances here:
>
>cooksd[1:2]
>
>sub <- rma.mv(yi, vi,
>              random = ~ 1 | study/experiment,
>              data=dat, subset=-1)
>
>(coef(res.ml) - coef(sub))^2 / vcov(res.ml)
>
>sub <- rma.mv(yi, vi,
>              random = ~ 1 | study/experiment,
>              data=dat, subset=-2)
>
>(coef(res.ml) - coef(sub))^2 / vcov(res.ml)
>
>As you will see, these match up.
>
>With these data, removal of a single row can have a large impact on the variance
>components, which can lead to considerable changes in the pooled estimate.
>
>Best,
>Wolfgang
>
>>-----Original Message-----
>>From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces using r-project.org] On
>>Behalf Of Reza Norouzian
>>Sent: Friday, 04 February, 2022 8:01
>>To: Antonina Dolgorukova
>>Cc: R meta
>>Subject: Re: [R-meta] possible miscalculation of Cook’s distances
>>
>>Dear Antonina,
>>
>>There was no rationale, I just wanted to indicate that there is no bug
>>in the function (and save myself a tiny bit of time). The default
>>reestimate = TRUE is certainly to be preferred given that your model
>>does include random-effects and, as the documentation correctly
>>mentions, the influence of each effect estimate on the estimates of
>>heterogeneity and correlation can only be examined, if you set
>>reestimate = TRUE.
>>
>>For instance, if you remove the random-effects from your initial model
>>(res.ml), then, you'll see that the use of reestimate = TRUE or FALSE
>>has no effect. In both cases, only the effect estimate associated with
>>study 1, experiment 1 is influential. Thus, this shows that the use of
>>reestimate = FALSE in your initial model essentially ignored the
>>influence of each effect estimate on the estimates of heterogeneity.
>>
>>In models with complex random-effects structure esp. fit to large
>>datasets, setting reestimate = TRUE would take a good chunk of time.
>>In such cases, some (myself included) may be tempted to set reestimate
>>= FALSE hoping that the approximation will be close enough.
>>
>>Kind regards,
>>Reza
>>
>>res.ml2 <- update.rma(res.ml, random = NULL)
>>
>>cook_with_reest2 <- cooks.distance.rma.mv(res.ml2)
>>cook_without_reest2 <- cooks.distance.rma.mv(res.ml2, reestimate = FALSE)
>>
>>plot(cook_with_reest2,type="b")
>>plot(cook_without_reest2,type="b")
>>
>>On Fri, Feb 4, 2022 at 12:19 AM Antonina Dolgorukova
>><an.dolgorukova using gmail.com> wrote:
>>>
>>> Dear Reza,
>>>
>>> Thank you very much for your reply. I do understand that not all outliers have
>>to be influential and that not all influential cases have to be outliers. But
>>thank you for mentioning this and for the code provided. I did notice you set
>the
>>reestimate = FALSE. This raises two questions, actually. It would be great if
>you
>>help with them.
>>>
>>> 1) Is there any explanation why cooks.distance.rma.mv(res.ml) and
>>cooks.distance.rma.mv(res.ml , reestimate = FALSE) give completely different
>>results about the experiments 1 and 2 (and very similar for the remaining
>>experiments)?
>>>
>>> According to cooks.distance.rma.mv(res.ml), the Study 1, Experiment 2 is
>>influential, but
>>> according to  cooks.distance.rma.mv(res.ml , reestimate = FALSE), Study 1,
>>Experiment 1 is influential
>>>
>>>
>>> #the code illustrating my question
>>>
>>> dat <- data.frame(study=c(1,1,2,3,3,3), experiment=c(1:6),
>>>                   yi=c( 68, 18, 31,20,10,26),
>>>                   vi=c(60,32, 15, 19, 41, 82))
>>>
>>> res.ml <- rma.mv(yi, vi,
>>>                  random = ~ 1 | study/experiment,
>>>                  data=dat,
>>>                  slab = paste("Study ", study,", ", "Experiment ", experiment,
>>sep = ""))
>>>
>>> cook_with_reest <- cooks.distance.rma.mv(res.ml)
>>> cook_without_reest <- cooks.distance.rma.mv(res.ml, reestimate = FALSE)
>>>
>>> plot(1:length(cook_with_reest), cook_with_reest, ylim=c(0,2), type="o",
>pch=19,
>>las=1, xaxt="n",
>>>      xlab="Study and Experiment ID", ylab="Cook's Distance")
>>> points(cook_without_reest, type="o", pch=19, col = "red")
>>> axis(1, 1:length(cook_with_reest), labels=names(cook_with_reest))
>>> abline(h=4/length(cook_with_reest), lty="dotted", lwd=2)
>>> legend("top", pch=19, col=c("black","red"), lty="solid",
>>>        legend=c("reestimate = TRUE","reestimate = FALSE"), bty="n")
>>>
>>> 2) And could you please explain what is the rationale to set reestimate =
>>FALSE? According to the metafor documentation:
>>> "Doing so only yields an approximation to the Cook’s distances that ignores
>the
>>influence of the ith case on the variance/correlation components"
>>>
>>> Sincerely,
>>> Antonina
>>>
>>> On Fri, Feb 4, 2022 at 3:29 AM Reza Norouzian <rnorouzian using gmail.com> wrote:
>>>>
>>>> Dear Antonina,
>>>>
>>>> An effect estimate whose standardized deleted residual falls beyond
>>>> +/-1.96 doesn't necessarily need to have a cook's distance
>>>> (influence), and/or a hat value (if additional moderators are used)
>>>> that are likewise extreme.
>>>>
>>>> As a general proposition, it may be methodologically more reasonable
>>>> to simultaneously consult all these indices to flag an extreme effect
>>>> estimate.
>>>>
>>>> That said, in your case, it does seem that the estimate from study 1,
>>>> experiment 1 has a large standardized deleted residual as well as a
>>>> large Cook's distance relative to that of other effect estimates. So,
>>>> I don't think there is any bug in any of the metafor functions you
>>>> used.
>>>>
>>>> Below is a bit of code to catch that outlying effect estimate.
>>>>
>>>> Kind regards,
>>>> Reza
>>>>
>>>> dat <- data.frame(study=c(1,1,2,3,3,3), experiment=c(1:6),
>>>>                   yi=c( 68, 18, 31,20,10,26),
>>>>                   vi=c(60,32, 15, 19, 41, 82))
>>>>
>>>> res.ml <- rma.mv(yi, vi,
>>>>                  random = ~ 1 | study/experiment,
>>>>                  data=dat,
>>>>                  slab = paste("Study ", study,", ", "Experiment ",
>>>>                               experiment, sep = ""))
>>>>
>>>> dat <- transform(dat, obs = res.ml$slab)
>>>>
>>>> outlier_limit <- qnorm(.975)
>>>>
>>>> cook <- cooks.distance.rma.mv(res.ml,
>>>>                               reestimate = FALSE)
>>>>
>>>> st_del_res_z <- rstudent.rma.mv(res.ml,
>>>>                                 reestimate = FALSE)$z
>>>>
>>>> cook_limit <- max(mean(range(cook)), boxplot.stats(cook, coef =
>1.5)$stats[5])
>>>>
>>>> i <- abs(st_del_res_z) > outlier_limit
>>>> k <- cook > cook_limit
>>>> oo <- which(i & k)
>>>>
>>>> LL <- names(oo)
>>>>
>>>> removed <- subset(dat, obs %in% LL)
>>>> new_dat <- subset(dat, !obs %in% LL)
>>>>
>>>> On Thu, Feb 3, 2022 at 3:36 PM Antonina Dolgorukova
>>>> <an.dolgorukova using gmail.com> wrote:
>>>> >
>>>> > Dear Dr. Viechtbauer and all,
>>>> >
>>>> > I have a multilevel data structure (experiments nested within studies) are
>>>> > use rma.mv to calculate an overall effect estimate. The next step requires
>>>> > sensitivity analysis on the experiment-level data. For detecting outliers
>>>> > I've used standardized (deleted) residuals and for detecting influential
>>>> > experiments I've used Cook’s distance. However, the last test provides
>>>> > contradictory results.
>>>> >
>>>> > The forest plot indicates that the 1st experiment may be an outlier, and
>>>> > standardized (deleted) residuals confirm this. But according to Cook’s
>>>> > distance plot, the 2nd experiment is influential. It seems that there may
>>>> > be a miscalculation of Cook’s distance since I can easily reproduce this
>>>> > issue (in one study one of the experiments have to provide a much larger ES
>>>> > than the other) also if use the model with random = ~ 1 | experiment, the
>>>> > 1st experiment is influential, not the second.
>>>> >
>>>> > Could you please clarify is this a bug or a feature of the cooks.distance()
>>>> > function? Maybe it does not work properly with rma.mv objects?
>>>> >
>>>> >   ## Reproducible Example
>>>> >
>>>> > # data frame
>>>> >   dat <- data.frame(study=c(1,1,2,3,3,3), experiment=c(1:6),
>>>> >                     yi=c( 68, 18, 31,20,10,26),
>>>> >                     vi=c(60,32, 15, 19, 41, 82))
>>>> >
>>>> > # multilevel model
>>>> > res.ml <- rma.mv(yi, vi,
>>>> >                  random = ~ 1 | study/experiment,
>>>> >                  data=dat,
>>>> >                  slab = paste("Study ", study,", ", "Experiment ",
>>>> > experiment, sep = ""))
>>>> >
>>>> > # forest plot examination  indicates that the 1st experiment may be an
>>>> > outlier
>>>> > forest(res.ml,
>>>> >        header = "Study and Experiment ID")
>>>> >
>>>> > # standardized (deleted) residuals confirm this
>>>> > rst <- rstudent(res.ml)
>>>> > plot(NA, NA, xlim=c(1, res.ml$k), ylim=c(-3,5),
>>>> >      xlab="Study and Experiment ID", ylab="Standardized (Deleted)
>Residual",
>>>> >      xaxt="n", las=1)
>>>> > axis(side=1, at=1:res.ml$k, labels=rst$slab)
>>>> > abline(h=c(-1.96,1.96), lty="dotted")
>>>> > abline(h=0)
>>>> > points(1:res.ml$k, rst$z, type="o", pch=19)
>>>> >
>>>> > # according to Cook’s distance plot, the 2nd experiment is influential
>>>> > cooksd <- cooks.distance(res.ml)
>>>> > plot(1:length(cooksd), cooksd, ylim=c(0,2), type="o", pch=19, las=1,
>>>> > xaxt="n",
>>>> >      xlab="Experiment ID", ylab="Cook's Distance")
>>>> > axis(1, 1:length(cooksd), labels=names(cooksd))
>>>> > abline(h=4/length(cooksd), lty="dotted", lwd=2)
>>>> >
>>>> > sessionInfo()
>>>> > R version 4.1.2 (2021-11-01)
>>>> > Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>> > Running under: Windows 10 x64 (build 19042)
>>>> > [1] metafor_3.1-43 metadat_1.0-0  Matrix_1.3-4
>>>> >
>>>> > --
>>>> > Antonina Dolgorukova, M.D.
>>>> > Pavlov First Saint Petersburg State Medical University
>>>> > Lev Tolstoy str., 6-8, 197022
>>>> > Saint Petersburg, Russia