[R-meta] Wald_test - is it powerful enough?

Wed Sep 22 09:00:00 CEST 2021

Dear James and Wolfgang,

Would you say that the RVE models are more penalising of estimates when we
have small samples?
I wonder if that also could explain the disparity between wald tests and
rma.mv since for the rest of the analyses when I have bigger sample sizes
the results seem to converge a lot more.

Best wishes,

Catia

On Fri, 3 Sept 2021 at 14:39, James Pustejovsky <jepusto using gmail.com> wrote:

> Hi Catia,
>
> Responses below.
>
> James
>
> On Fri, Sep 3, 2021 at 2:00 AM Cátia Ferreira De Oliveira <
> cmfo500 using york.ac.uk> wrote:
>
>> Dear Wolfgang and James,
>>
>> Thank you for your responses. I very much appreciate the fact that you
>> both spent so much time on the responses!
>> Just some follow-up questions, if that's ok. Would you still report the
>> QM tests or just the models corrected for misspecification? I wonder if it
>> would be valuable to refer to the patterns that potentially exist in the
>> dataset whilst making a note about the possibility of it being a fluke
>> would be reasonable.
>>
>
> I agree with Michael that it's good to report sensitivity analyses for
> reasonable alternative approaches to analyzing the data. Here, however, I
> think it depends a bit on what your overall analytic approach is. If the
> rest of your analysis is all based on RVE inference, then I don't think it
> would make as much sense to report the model-based QM test results just for
> one particular hypothesis.
>
>
>> Also, a very simple question but I am wondering what would be the best
>> way of coding for multiple experiments published within one study, all with
>> completely different participants. Would you code it as all belonging to
>> the same study (same identifier) with only the specific variables varying
>> (e.g. age and other information of relevance) or would you identify it as
>> being separate studies e.g. (X, 2012a; X, 2012b...)? I am sorry if this is
>> a really basic question!
>>
>
> As Michael noted, this can be understood as an additional level of
> nesting. One approach to handling this is to include an additional level of
> random effects in the (working) model: random effects for each unique study
> ID, random effects for each unique experiment ID (nested within study), and
> possibly random effects for each unique outcome/effect size. Depending on
> how many studies include multiple independent experiments, it might not be
> possible to estimate this model very well and you'd have to simplify (e.g.,
> by dropping the middle level of random effects per experiment).
>
> One thing to note with this set-up: when creating the V-matrix, you should
> use the experiment ID as the clustering variable because the sampling
> errors of the effect size estimates are only correlated if they're based on
> the same (or overlapping) sample of participants. But then when using RVE
> for SEs/hypothesis tests/CIs, you should cluster on the top-level study
> IDs. clubSandwich will cluster by the top level of random effects by
> default, so you can do this just by omitting the cluster = argument when
> using vcovCR(), coef_test(), Wald_test(), or conf_int().
>
>
>
>>
>> Thank you for your help!
>>
>> Best wishes,
>>
>> Catia
>>
>> On Thu, 2 Sept 2021 at 15:31, James Pustejovsky <jepusto using gmail.com>
>> wrote:
>>
>>> Catia,
>>>
>>> I'll add a few observations to Wolfgang's points. I agree that it
>>> doesn't really make sense to compare the model-based QM test to the robust
>>> HTZ test because they don't necessarily have the same Type I error rates.
>>> As Wolfgang noted, the QM test will only maintain Type I error if you've
>>> correctly modeled the correlations and random effects structure (which
>>> seems implausible given that r = 0.6 is a very rough, arbitrary
>>> assumption). A further issue is that the QM test also relies on
>>> large-sample approximation (because it uses a chi-squared reference
>>> distribution) and so requires a sufficiently large number of studies to
>>> provide calibrated Type I error. With a small number of studies, it will
>>> tend to have overly generous Type I error (yielding p-values that are
>>> smaller than they should be). So two strikes against it.
>>>
>>> The cluster-robust HTZ test provides a level of insurance against model
>>> mis-specification. It also uses small-sample adjustments so that it should
>>> properly control Type I error rates even when based on a fairly small
>>> number of studies. However, it does still entail a degree of approximation
>>> and will not necessarily have *exactly* calibrated Type I error rates. In
>>> Tipton & Pustejovsky (2015, cited in Wolfgang's response), we found that
>>> the HTZ test properly controls Type I error rates, meaning that error rates
>>> were always at or below the nominal level. But we also observed that HTZ
>>> can be conservative, in the sense that it sometimes has Type I error rates
>>> substantially below the nominal level (such as .01 when alpha = .05). This
>>> suggests that the test can sometimes have very limited power. It seems
>>> you've identified one such situation.
>>>
>>> As Wolfgang noted, the denominator degrees of freedom of the robust HTZ
>>> test are very small here, which indicates a scenario where the HTZ might
>>> have problems. For this particular example, I expect that it is because
>>> effect sizes of type "covert" occur only in a single study and effects of
>>> type "overt" occur only in a few:
>>>
>>> library(dplyr)
>>> dat %>%
>>>   group_by(deltype) %>%
>>>   summarise(
>>>     studies = n_distinct(study),
>>>     effects = n()
>>>   )
>>>
>>> # A tibble: 3 x 3
>>>   deltype studies effects
>>>   <chr>     <int>   <int>
>>> 1 covert        1       9
>>> 2 general      17      78
>>> 3 overt         3      13
>>>
>>> This is a situation where RVE is not going to work well (if at all)
>>> because RVE is based on only between-study variation in effect sizes.
>>> Another way to check this is to look at the Satterthwaite degrees of
>>> freedom of the individual coefficients you are testing against zero:
>>>
>>> conf_int(res, vcov = "CR2")
>>>
>>>             Coef Estimate     SE  d.f. Lower 95% CI Upper 95% CI
>>> 1  deltypecovert   -0.290 0.0823  4.20       -0.515      -0.0658
>>> 2 deltypegeneral    0.416 0.0997 14.76        0.203       0.6288
>>> 3   deltypeovert    0.160 0.0860  2.67       -0.134       0.4539
>>>
>>> As you can see, the first and the third coefficients have very few
>>> degrees of freedom, so the uncertainty around them will be less well
>>> quantified.
>>>
>>> In situations like this, I think it is advisable to limit the tests to
>>> coefficients for which at least some minimum number of studies have effect
>>> size estimates (e.g., at least 4 studies). Applying that rule here would
>>> mean limiting the test to only deltype = "general":
>>>
>>> Wald_test(res, constraints = constrain_zero(2), vcov = "CR2")
>>>
>>> test Fstat df_num df_denom  p_val sig
>>>   HTZ  17.4      1     14.8 <0.001 ***
>>>
>>> or equivalently:
>>> coef_test(res, vcov = "CR2", coefs = 2)
>>>
>>>            Coef. Estimate     SE t-stat d.f. p-val (Satt) Sig.
>>> 1 deltypegeneral    0.416 0.0997   4.17 14.8       <0.001  ***
>>>
>>> James
>>>
>>>
>>> On Thu, Sep 2, 2021 at 5:07 AM Viechtbauer, Wolfgang (SP) <
>>> wolfgang.viechtbauer using maastrichtuniversity.nl> wrote:
>>>
>>>> Dear Cátia,
>>>>
>>>> A comparison of power is only really appropriate if the two tests would
>>>> have the same Type I error rate. I can always create a test that
>>>> outperforms all other tests in terms of power by *always* rejecting, but
>>>> then my test also has a 100% Type I error rate, so it is useless.
>>>>
>>>> So, whether the cluster-robust Wald test (i.e, Wald_test()) or the
>>>> standard Wald-type Q-M test differ in power is a futile question unless we
>>>> know that both tests control the Type I error rate. This is impossible to
>>>> say in general - it depends on many factors.
>>>>
>>>> In this example, you are using an approximate V matrix and fitting a
>>>> multilevel model (using the multivariate parameterization). That might be a
>>>> reasonable working model, although the V matrix is just a very rough
>>>> approximation (one would have to look at the details of all articles to see
>>>> what kind of dependencies there are between the estimates within studies)
>>>> and r=0.6 might or might not be reasonable.
>>>>
>>>> So using cluster-robust inference is a sensible further step as an
>>>> additional 'safeguard', although there are 'only' 17 studies.
>>>> Cluster-robust inference methods work asymptotically, so as the number of
>>>> studies goes to infinity. How 'close to infinity' we have to be before we
>>>> can trust the cluster-robust inferences is another difficult question that
>>>> is impossible to answer in general. These article should provide some
>>>> discussions around this:
>>>>
>>>> Tanner-Smith, E. E., & Tipton, E. (2014). Robust variance estimation
>>>> with dependent effect sizes: Practical considerations including a software
>>>> tutorial in Stata and SPSS. Research Synthesis Methods, 5(1), 13-30.
>>>> https://doi.org/10.1002/jrsm.1091
>>>>
>>>> Tipton, E., & Pustejovsky, J. E. (2015). Small-sample adjustments for
>>>> tests of moderators and model fit using robust variance estimation in
>>>> meta-regression. Journal of Educational and Behavioral Statistics, 40(6),
>>>> 604-634. https://doi.org/10.3102/1076998615606099
>>>>
>>>> Tipton, E. (2015). Small sample adjustments for robust variance
>>>> estimation with meta-regression. Psychological Methods, 20(3), 375-393.
>>>> https://doi.org/10.1037/met0000011
>>>>
>>>> Tanner-Smith, E. E., Tipton, E., & Polanin, J. R. (2016). Handling
>>>> complex meta-analytic data structures using robust variance estimates: A
>>>> tutorial in R. Journal of Developmental and Life-Course Criminology, 2(1),
>>>> 85-112. https://doi.org/10.1007/s40865-016-0026-5
>>>>
>>>> Here, the cluster-robust Wald-test makes use of a small-sample
>>>> correction that should improve its performance when the number of studies
>>>> is small. I assume though that also with this correction, there are limits
>>>> to how well the test works when the number of studies gets really low.
>>>> James or Elizabeth might be in a better position to comment on this.
>>>>
>>>> An interesting question is whether the degree of discrepancy between
>>>> the standard and the cluster-robust Wald-test could be used as a rough
>>>> measure to what extent the working model is reasonable, and if so, how to
>>>> quantify the degree of the discrepancy. Despite the difference in p-values,
>>>> the size of the test statistics are actually quite large for both tests.
>>>> It's just that the estimated denominator degrees of freedom (which I
>>>> believe is based on a Satterthwaite approximation, which is also an
>>>> asymptotic method) for the F-test (1.08) are very small, so that even with
>>>> F=40.9 (and df=3 in the numerator), the test ends up being not significant
>>>> (p=0.0998) -- but that just narrowly misses being a borderline trend
>>>> approaching the brink of statistical significance ... :/
>>>>
>>>> I personally would say that the tests are actually not *that*
>>>> discrepant, although I have a relatively high tolerance for discrepancies
>>>> when it comes to such sensitivity analyses (I just know how much fudging is
>>>> typically involved when it comes to things like the extraction /
>>>> calculation of the effect size estimates themselves, so that discussions
>>>> around these more subtle statistical details - which are definitely fun and
>>>> help me procrastinate - kind of miss the elephant in the room).
>>>>
>>>> Best,
>>>> Wolfgang
>>>>
>>>> >-----Original Message-----
>>>> >From: R-sig-meta-analysis [mailto:
>>>> r-sig-meta-analysis-bounces using r-project.org] On
>>>> >Behalf Of Cátia Ferreira De Oliveira
>>>> >Sent: Thursday, 02 September, 2021 2:23
>>>> >To: R meta
>>>> >Subject: [R-meta] Wald_test - is it powerful enough?
>>>> >
>>>> >Hello,
>>>> >
>>>> >I hope you are well.
>>>> >Is the Wald_test a lot less powerful than the QM test? I ask this
>>>> because
>>>> >in the example below the QM test is significant but the Wald test is
>>>> not,
>>>> >shouldn't they be equivalent?
>>>> >If it is indeed the case that the Wald_test is not powerful enough to
>>>> >detect a difference, is there a good equivalent test more powerful
>>>> than the
>>>> >Wald test that can be used alongside the robumeta package?
>>>> >
>>>> >Best wishes,
>>>> >
>>>> >Catia
>>>> >
>>>> >*dat <- dat.assink2016*
>>>> >*V <- impute_covariance_matrix(dat$vi, cluster=dat$study, r=0.6)*
>>>> >
>>>> >*# fit multivariate model with delinquency type as moderator*
>>>> >
>>>> >*res <- rma.mv(yi, V, mods = ~ deltype-1, random = ~
>>>> >factor(esid) | study, data=dat)*
>>>> >*res*
>>>> >
>>>> >*Multivariate Meta-Analysis Model (k = 100; method: REML)*
>>>> >
>>>> >*Variance Components:*
>>>> >
>>>> >*outer factor: study        (nlvls = 17)*
>>>> >*inner factor: factor(esid) (nlvls = 22)*
>>>> >
>>>> >*estim    sqrt  fixed*
>>>> >*tau^2      0.2150  0.4637     no*
>>>> >*rho        0.3990             no*
>>>> >
>>>> >*Test for Residual Heterogeneity:*
>>>> >*QE(df = 97) = 639.0911, p-val < .0001*
>>>> >
>>>> >*Test of Moderators (coefficients 1:3):*
>>>> >*QM(df = 3) = 28.0468, p-val < .0001*
>>>> >
>>>> >*Model Results:*
>>>> >
>>>> >*                estimate      se     zval    pval    ci.lb   ci.ub
>>>> >*deltypecovert    -0.2902  0.2083  -1.3932  0.1635  -0.6984  0.1180
>>>> >*deltypegeneral    0.4160  0.0975   4.2688  <.0001   0.2250  0.6070***
>>>> >*deltypeovert      0.1599  0.1605   0.9963  0.3191  -0.1546  0.4743
>>>> >
>>>> >*---*
>>>> >*Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*
>>>> >
>>>> >*Wald_test(res, constraints=constrain_zero(1:3), vcov="CR2",
>>>> >cluster=dat$study)*
>>>> >
>>>> >*test Fstat df_num df_denom  p_val sig*
>>>> >*  HTZ  40.9      3     1.08 0.0998   .*
>>>> >
>>>> >Thank you,
>>>> >
>>>> >Catia
>>>> _______________________________________________
>>>> R-sig-meta-analysis mailing list
>>>> R-sig-meta-analysis using r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis
>>>>
>>>
>>
>> --
>> Cátia Margarida Ferreira de Oliveira
>> Psychology PhD Student
>> Department of Psychology, Room B214
>> University of York, YO10 5DD
>> pronouns: she, her
>>
>

-- 
Cátia Margarida Ferreira de Oliveira
Psychology PhD Student
Department of Psychology, Room A105
University of York, YO10 5DD
Twitter: @CatiaMOliveira
pronouns: she, her

	[[alternative HTML version deleted]]