[R-meta] Wald_test - is it powerful enough?

James Pustejovsky jepu@to @end|ng |rom gm@||@com
Fri Sep 3 15:39:42 CEST 2021


Hi Catia,

Responses below.

James

On Fri, Sep 3, 2021 at 2:00 AM Cátia Ferreira De Oliveira <
cmfo500 using york.ac.uk> wrote:

> Dear Wolfgang and James,
>
> Thank you for your responses. I very much appreciate the fact that you
> both spent so much time on the responses!
> Just some follow-up questions, if that's ok. Would you still report the QM
> tests or just the models corrected for misspecification? I wonder if it
> would be valuable to refer to the patterns that potentially exist in the
> dataset whilst making a note about the possibility of it being a fluke
> would be reasonable.
>

I agree with Michael that it's good to report sensitivity analyses for
reasonable alternative approaches to analyzing the data. Here, however, I
think it depends a bit on what your overall analytic approach is. If the
rest of your analysis is all based on RVE inference, then I don't think it
would make as much sense to report the model-based QM test results just for
one particular hypothesis.


> Also, a very simple question but I am wondering what would be the best way
> of coding for multiple experiments published within one study, all with
> completely different participants. Would you code it as all belonging to
> the same study (same identifier) with only the specific variables varying
> (e.g. age and other information of relevance) or would you identify it as
> being separate studies e.g. (X, 2012a; X, 2012b...)? I am sorry if this is
> a really basic question!
>

As Michael noted, this can be understood as an additional level of nesting.
One approach to handling this is to include an additional level of random
effects in the (working) model: random effects for each unique study ID,
random effects for each unique experiment ID (nested within study), and
possibly random effects for each unique outcome/effect size. Depending on
how many studies include multiple independent experiments, it might not be
possible to estimate this model very well and you'd have to simplify (e.g.,
by dropping the middle level of random effects per experiment).

One thing to note with this set-up: when creating the V-matrix, you should
use the experiment ID as the clustering variable because the sampling
errors of the effect size estimates are only correlated if they're based on
the same (or overlapping) sample of participants. But then when using RVE
for SEs/hypothesis tests/CIs, you should cluster on the top-level study
IDs. clubSandwich will cluster by the top level of random effects by
default, so you can do this just by omitting the cluster = argument when
using vcovCR(), coef_test(), Wald_test(), or conf_int().



>
> Thank you for your help!
>
> Best wishes,
>
> Catia
>
> On Thu, 2 Sept 2021 at 15:31, James Pustejovsky <jepusto using gmail.com> wrote:
>
>> Catia,
>>
>> I'll add a few observations to Wolfgang's points. I agree that it doesn't
>> really make sense to compare the model-based QM test to the robust HTZ test
>> because they don't necessarily have the same Type I error rates. As
>> Wolfgang noted, the QM test will only maintain Type I error if you've
>> correctly modeled the correlations and random effects structure (which
>> seems implausible given that r = 0.6 is a very rough, arbitrary
>> assumption). A further issue is that the QM test also relies on
>> large-sample approximation (because it uses a chi-squared reference
>> distribution) and so requires a sufficiently large number of studies to
>> provide calibrated Type I error. With a small number of studies, it will
>> tend to have overly generous Type I error (yielding p-values that are
>> smaller than they should be). So two strikes against it.
>>
>> The cluster-robust HTZ test provides a level of insurance against model
>> mis-specification. It also uses small-sample adjustments so that it should
>> properly control Type I error rates even when based on a fairly small
>> number of studies. However, it does still entail a degree of approximation
>> and will not necessarily have *exactly* calibrated Type I error rates. In
>> Tipton & Pustejovsky (2015, cited in Wolfgang's response), we found that
>> the HTZ test properly controls Type I error rates, meaning that error rates
>> were always at or below the nominal level. But we also observed that HTZ
>> can be conservative, in the sense that it sometimes has Type I error rates
>> substantially below the nominal level (such as .01 when alpha = .05). This
>> suggests that the test can sometimes have very limited power. It seems
>> you've identified one such situation.
>>
>> As Wolfgang noted, the denominator degrees of freedom of the robust HTZ
>> test are very small here, which indicates a scenario where the HTZ might
>> have problems. For this particular example, I expect that it is because
>> effect sizes of type "covert" occur only in a single study and effects of
>> type "overt" occur only in a few:
>>
>> library(dplyr)
>> dat %>%
>>   group_by(deltype) %>%
>>   summarise(
>>     studies = n_distinct(study),
>>     effects = n()
>>   )
>>
>> # A tibble: 3 x 3
>>   deltype studies effects
>>   <chr>     <int>   <int>
>> 1 covert        1       9
>> 2 general      17      78
>> 3 overt         3      13
>>
>> This is a situation where RVE is not going to work well (if at all)
>> because RVE is based on only between-study variation in effect sizes.
>> Another way to check this is to look at the Satterthwaite degrees of
>> freedom of the individual coefficients you are testing against zero:
>>
>> conf_int(res, vcov = "CR2")
>>
>>             Coef Estimate     SE  d.f. Lower 95% CI Upper 95% CI
>> 1  deltypecovert   -0.290 0.0823  4.20       -0.515      -0.0658
>> 2 deltypegeneral    0.416 0.0997 14.76        0.203       0.6288
>> 3   deltypeovert    0.160 0.0860  2.67       -0.134       0.4539
>>
>> As you can see, the first and the third coefficients have very few
>> degrees of freedom, so the uncertainty around them will be less well
>> quantified.
>>
>> In situations like this, I think it is advisable to limit the tests to
>> coefficients for which at least some minimum number of studies have effect
>> size estimates (e.g., at least 4 studies). Applying that rule here would
>> mean limiting the test to only deltype = "general":
>>
>> Wald_test(res, constraints = constrain_zero(2), vcov = "CR2")
>>
>> test Fstat df_num df_denom  p_val sig
>>   HTZ  17.4      1     14.8 <0.001 ***
>>
>> or equivalently:
>> coef_test(res, vcov = "CR2", coefs = 2)
>>
>>            Coef. Estimate     SE t-stat d.f. p-val (Satt) Sig.
>> 1 deltypegeneral    0.416 0.0997   4.17 14.8       <0.001  ***
>>
>> James
>>
>>
>> On Thu, Sep 2, 2021 at 5:07 AM Viechtbauer, Wolfgang (SP) <
>> wolfgang.viechtbauer using maastrichtuniversity.nl> wrote:
>>
>>> Dear Cátia,
>>>
>>> A comparison of power is only really appropriate if the two tests would
>>> have the same Type I error rate. I can always create a test that
>>> outperforms all other tests in terms of power by *always* rejecting, but
>>> then my test also has a 100% Type I error rate, so it is useless.
>>>
>>> So, whether the cluster-robust Wald test (i.e, Wald_test()) or the
>>> standard Wald-type Q-M test differ in power is a futile question unless we
>>> know that both tests control the Type I error rate. This is impossible to
>>> say in general - it depends on many factors.
>>>
>>> In this example, you are using an approximate V matrix and fitting a
>>> multilevel model (using the multivariate parameterization). That might be a
>>> reasonable working model, although the V matrix is just a very rough
>>> approximation (one would have to look at the details of all articles to see
>>> what kind of dependencies there are between the estimates within studies)
>>> and r=0.6 might or might not be reasonable.
>>>
>>> So using cluster-robust inference is a sensible further step as an
>>> additional 'safeguard', although there are 'only' 17 studies.
>>> Cluster-robust inference methods work asymptotically, so as the number of
>>> studies goes to infinity. How 'close to infinity' we have to be before we
>>> can trust the cluster-robust inferences is another difficult question that
>>> is impossible to answer in general. These article should provide some
>>> discussions around this:
>>>
>>> Tanner-Smith, E. E., & Tipton, E. (2014). Robust variance estimation
>>> with dependent effect sizes: Practical considerations including a software
>>> tutorial in Stata and SPSS. Research Synthesis Methods, 5(1), 13-30.
>>> https://doi.org/10.1002/jrsm.1091
>>>
>>> Tipton, E., & Pustejovsky, J. E. (2015). Small-sample adjustments for
>>> tests of moderators and model fit using robust variance estimation in
>>> meta-regression. Journal of Educational and Behavioral Statistics, 40(6),
>>> 604-634. https://doi.org/10.3102/1076998615606099
>>>
>>> Tipton, E. (2015). Small sample adjustments for robust variance
>>> estimation with meta-regression. Psychological Methods, 20(3), 375-393.
>>> https://doi.org/10.1037/met0000011
>>>
>>> Tanner-Smith, E. E., Tipton, E., & Polanin, J. R. (2016). Handling
>>> complex meta-analytic data structures using robust variance estimates: A
>>> tutorial in R. Journal of Developmental and Life-Course Criminology, 2(1),
>>> 85-112. https://doi.org/10.1007/s40865-016-0026-5
>>>
>>> Here, the cluster-robust Wald-test makes use of a small-sample
>>> correction that should improve its performance when the number of studies
>>> is small. I assume though that also with this correction, there are limits
>>> to how well the test works when the number of studies gets really low.
>>> James or Elizabeth might be in a better position to comment on this.
>>>
>>> An interesting question is whether the degree of discrepancy between the
>>> standard and the cluster-robust Wald-test could be used as a rough measure
>>> to what extent the working model is reasonable, and if so, how to quantify
>>> the degree of the discrepancy. Despite the difference in p-values, the size
>>> of the test statistics are actually quite large for both tests. It's just
>>> that the estimated denominator degrees of freedom (which I believe is based
>>> on a Satterthwaite approximation, which is also an asymptotic method) for
>>> the F-test (1.08) are very small, so that even with F=40.9 (and df=3 in the
>>> numerator), the test ends up being not significant (p=0.0998) -- but that
>>> just narrowly misses being a borderline trend approaching the brink of
>>> statistical significance ... :/
>>>
>>> I personally would say that the tests are actually not *that*
>>> discrepant, although I have a relatively high tolerance for discrepancies
>>> when it comes to such sensitivity analyses (I just know how much fudging is
>>> typically involved when it comes to things like the extraction /
>>> calculation of the effect size estimates themselves, so that discussions
>>> around these more subtle statistical details - which are definitely fun and
>>> help me procrastinate - kind of miss the elephant in the room).
>>>
>>> Best,
>>> Wolfgang
>>>
>>> >-----Original Message-----
>>> >From: R-sig-meta-analysis [mailto:
>>> r-sig-meta-analysis-bounces using r-project.org] On
>>> >Behalf Of Cátia Ferreira De Oliveira
>>> >Sent: Thursday, 02 September, 2021 2:23
>>> >To: R meta
>>> >Subject: [R-meta] Wald_test - is it powerful enough?
>>> >
>>> >Hello,
>>> >
>>> >I hope you are well.
>>> >Is the Wald_test a lot less powerful than the QM test? I ask this
>>> because
>>> >in the example below the QM test is significant but the Wald test is
>>> not,
>>> >shouldn't they be equivalent?
>>> >If it is indeed the case that the Wald_test is not powerful enough to
>>> >detect a difference, is there a good equivalent test more powerful than
>>> the
>>> >Wald test that can be used alongside the robumeta package?
>>> >
>>> >Best wishes,
>>> >
>>> >Catia
>>> >
>>> >*dat <- dat.assink2016*
>>> >*V <- impute_covariance_matrix(dat$vi, cluster=dat$study, r=0.6)*
>>> >
>>> >*# fit multivariate model with delinquency type as moderator*
>>> >
>>> >*res <- rma.mv(yi, V, mods = ~ deltype-1, random = ~
>>> >factor(esid) | study, data=dat)*
>>> >*res*
>>> >
>>> >*Multivariate Meta-Analysis Model (k = 100; method: REML)*
>>> >
>>> >*Variance Components:*
>>> >
>>> >*outer factor: study        (nlvls = 17)*
>>> >*inner factor: factor(esid) (nlvls = 22)*
>>> >
>>> >*estim    sqrt  fixed*
>>> >*tau^2      0.2150  0.4637     no*
>>> >*rho        0.3990             no*
>>> >
>>> >*Test for Residual Heterogeneity:*
>>> >*QE(df = 97) = 639.0911, p-val < .0001*
>>> >
>>> >*Test of Moderators (coefficients 1:3):*
>>> >*QM(df = 3) = 28.0468, p-val < .0001*
>>> >
>>> >*Model Results:*
>>> >
>>> >*                estimate      se     zval    pval    ci.lb   ci.ub
>>> >*deltypecovert    -0.2902  0.2083  -1.3932  0.1635  -0.6984  0.1180
>>> >*deltypegeneral    0.4160  0.0975   4.2688  <.0001   0.2250  0.6070***
>>> >*deltypeovert      0.1599  0.1605   0.9963  0.3191  -0.1546  0.4743
>>> >
>>> >*---*
>>> >*Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*
>>> >
>>> >*Wald_test(res, constraints=constrain_zero(1:3), vcov="CR2",
>>> >cluster=dat$study)*
>>> >
>>> >*test Fstat df_num df_denom  p_val sig*
>>> >*  HTZ  40.9      3     1.08 0.0998   .*
>>> >
>>> >Thank you,
>>> >
>>> >Catia
>>> _______________________________________________
>>> R-sig-meta-analysis mailing list
>>> R-sig-meta-analysis using r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis
>>>
>>
>
> --
> Cátia Margarida Ferreira de Oliveira
> Psychology PhD Student
> Department of Psychology, Room B214
> University of York, YO10 5DD
> pronouns: she, her
>

	[[alternative HTML version deleted]]



More information about the R-sig-meta-analysis mailing list