[R-meta] Wald_test - is it powerful enough?
Michael Dewey
||@t@ @end|ng |rom dewey@myzen@co@uk
Fri Sep 3 13:00:23 CEST 2021
Dear Cátia
On 03/09/2021 08:00, Cátia Ferreira De Oliveira wrote:
> Dear Wolfgang and James,
>
> Thank you for your responses. I very much appreciate the fact that you both
> spent so much time on the responses!
> Just some follow-up questions, if that's ok. Would you still report the QM
> tests or just the models corrected for misspecification? I wonder if it
> would be valuable to refer to the patterns that potentially exist in the
> dataset whilst making a note about the possibility of it being a fluke
> would be reasonable.
I will leave that for James but would suggest that if different methods
lead to different conclusions then reporting both is safest although
rather awkward.
> Also, a very simple question but I am wondering what would be the best way
> of coding for multiple experiments published within one study, all with
> completely different participants. Would you code it as all belonging to
> the same study (same identifier) with only the specific variables varying
> (e.g. age and other information of relevance) or would you identify it as
> being separate studies e.g. (X, 2012a; X, 2012b...)? I am sorry if this is
> a really basic question!
>
If that happens then you have experiments nested within studies and
measures nested within experiments so I think you would ideally need to
account for that.
Michael
> Thank you for your help!
>
> Best wishes,
>
> Catia
>
> On Thu, 2 Sept 2021 at 15:31, James Pustejovsky <jepusto using gmail.com> wrote:
>
>> Catia,
>>
>> I'll add a few observations to Wolfgang's points. I agree that it doesn't
>> really make sense to compare the model-based QM test to the robust HTZ test
>> because they don't necessarily have the same Type I error rates. As
>> Wolfgang noted, the QM test will only maintain Type I error if you've
>> correctly modeled the correlations and random effects structure (which
>> seems implausible given that r = 0.6 is a very rough, arbitrary
>> assumption). A further issue is that the QM test also relies on
>> large-sample approximation (because it uses a chi-squared reference
>> distribution) and so requires a sufficiently large number of studies to
>> provide calibrated Type I error. With a small number of studies, it will
>> tend to have overly generous Type I error (yielding p-values that are
>> smaller than they should be). So two strikes against it.
>>
>> The cluster-robust HTZ test provides a level of insurance against model
>> mis-specification. It also uses small-sample adjustments so that it should
>> properly control Type I error rates even when based on a fairly small
>> number of studies. However, it does still entail a degree of approximation
>> and will not necessarily have *exactly* calibrated Type I error rates. In
>> Tipton & Pustejovsky (2015, cited in Wolfgang's response), we found that
>> the HTZ test properly controls Type I error rates, meaning that error rates
>> were always at or below the nominal level. But we also observed that HTZ
>> can be conservative, in the sense that it sometimes has Type I error rates
>> substantially below the nominal level (such as .01 when alpha = .05). This
>> suggests that the test can sometimes have very limited power. It seems
>> you've identified one such situation.
>>
>> As Wolfgang noted, the denominator degrees of freedom of the robust HTZ
>> test are very small here, which indicates a scenario where the HTZ might
>> have problems. For this particular example, I expect that it is because
>> effect sizes of type "covert" occur only in a single study and effects of
>> type "overt" occur only in a few:
>>
>> library(dplyr)
>> dat %>%
>> group_by(deltype) %>%
>> summarise(
>> studies = n_distinct(study),
>> effects = n()
>> )
>>
>> # A tibble: 3 x 3
>> deltype studies effects
>> <chr> <int> <int>
>> 1 covert 1 9
>> 2 general 17 78
>> 3 overt 3 13
>>
>> This is a situation where RVE is not going to work well (if at all)
>> because RVE is based on only between-study variation in effect sizes.
>> Another way to check this is to look at the Satterthwaite degrees of
>> freedom of the individual coefficients you are testing against zero:
>>
>> conf_int(res, vcov = "CR2")
>>
>> Coef Estimate SE d.f. Lower 95% CI Upper 95% CI
>> 1 deltypecovert -0.290 0.0823 4.20 -0.515 -0.0658
>> 2 deltypegeneral 0.416 0.0997 14.76 0.203 0.6288
>> 3 deltypeovert 0.160 0.0860 2.67 -0.134 0.4539
>>
>> As you can see, the first and the third coefficients have very few degrees
>> of freedom, so the uncertainty around them will be less well quantified.
>>
>> In situations like this, I think it is advisable to limit the tests to
>> coefficients for which at least some minimum number of studies have effect
>> size estimates (e.g., at least 4 studies). Applying that rule here would
>> mean limiting the test to only deltype = "general":
>>
>> Wald_test(res, constraints = constrain_zero(2), vcov = "CR2")
>>
>> test Fstat df_num df_denom p_val sig
>> HTZ 17.4 1 14.8 <0.001 ***
>>
>> or equivalently:
>> coef_test(res, vcov = "CR2", coefs = 2)
>>
>> Coef. Estimate SE t-stat d.f. p-val (Satt) Sig.
>> 1 deltypegeneral 0.416 0.0997 4.17 14.8 <0.001 ***
>>
>> James
>>
>>
>> On Thu, Sep 2, 2021 at 5:07 AM Viechtbauer, Wolfgang (SP) <
>> wolfgang.viechtbauer using maastrichtuniversity.nl> wrote:
>>
>>> Dear Cátia,
>>>
>>> A comparison of power is only really appropriate if the two tests would
>>> have the same Type I error rate. I can always create a test that
>>> outperforms all other tests in terms of power by *always* rejecting, but
>>> then my test also has a 100% Type I error rate, so it is useless.
>>>
>>> So, whether the cluster-robust Wald test (i.e, Wald_test()) or the
>>> standard Wald-type Q-M test differ in power is a futile question unless we
>>> know that both tests control the Type I error rate. This is impossible to
>>> say in general - it depends on many factors.
>>>
>>> In this example, you are using an approximate V matrix and fitting a
>>> multilevel model (using the multivariate parameterization). That might be a
>>> reasonable working model, although the V matrix is just a very rough
>>> approximation (one would have to look at the details of all articles to see
>>> what kind of dependencies there are between the estimates within studies)
>>> and r=0.6 might or might not be reasonable.
>>>
>>> So using cluster-robust inference is a sensible further step as an
>>> additional 'safeguard', although there are 'only' 17 studies.
>>> Cluster-robust inference methods work asymptotically, so as the number of
>>> studies goes to infinity. How 'close to infinity' we have to be before we
>>> can trust the cluster-robust inferences is another difficult question that
>>> is impossible to answer in general. These article should provide some
>>> discussions around this:
>>>
>>> Tanner-Smith, E. E., & Tipton, E. (2014). Robust variance estimation with
>>> dependent effect sizes: Practical considerations including a software
>>> tutorial in Stata and SPSS. Research Synthesis Methods, 5(1), 13-30.
>>> https://doi.org/10.1002/jrsm.1091
>>>
>>> Tipton, E., & Pustejovsky, J. E. (2015). Small-sample adjustments for
>>> tests of moderators and model fit using robust variance estimation in
>>> meta-regression. Journal of Educational and Behavioral Statistics, 40(6),
>>> 604-634. https://doi.org/10.3102/1076998615606099
>>>
>>> Tipton, E. (2015). Small sample adjustments for robust variance
>>> estimation with meta-regression. Psychological Methods, 20(3), 375-393.
>>> https://doi.org/10.1037/met0000011
>>>
>>> Tanner-Smith, E. E., Tipton, E., & Polanin, J. R. (2016). Handling
>>> complex meta-analytic data structures using robust variance estimates: A
>>> tutorial in R. Journal of Developmental and Life-Course Criminology, 2(1),
>>> 85-112. https://doi.org/10.1007/s40865-016-0026-5
>>>
>>> Here, the cluster-robust Wald-test makes use of a small-sample correction
>>> that should improve its performance when the number of studies is small. I
>>> assume though that also with this correction, there are limits to how well
>>> the test works when the number of studies gets really low. James or
>>> Elizabeth might be in a better position to comment on this.
>>>
>>> An interesting question is whether the degree of discrepancy between the
>>> standard and the cluster-robust Wald-test could be used as a rough measure
>>> to what extent the working model is reasonable, and if so, how to quantify
>>> the degree of the discrepancy. Despite the difference in p-values, the size
>>> of the test statistics are actually quite large for both tests. It's just
>>> that the estimated denominator degrees of freedom (which I believe is based
>>> on a Satterthwaite approximation, which is also an asymptotic method) for
>>> the F-test (1.08) are very small, so that even with F=40.9 (and df=3 in the
>>> numerator), the test ends up being not significant (p=0.0998) -- but that
>>> just narrowly misses being a borderline trend approaching the brink of
>>> statistical significance ... :/
>>>
>>> I personally would say that the tests are actually not *that* discrepant,
>>> although I have a relatively high tolerance for discrepancies when it comes
>>> to such sensitivity analyses (I just know how much fudging is typically
>>> involved when it comes to things like the extraction / calculation of the
>>> effect size estimates themselves, so that discussions around these more
>>> subtle statistical details - which are definitely fun and help me
>>> procrastinate - kind of miss the elephant in the room).
>>>
>>> Best,
>>> Wolfgang
>>>
>>>> -----Original Message-----
>>>> From: R-sig-meta-analysis [mailto:
>>> r-sig-meta-analysis-bounces using r-project.org] On
>>>> Behalf Of Cátia Ferreira De Oliveira
>>>> Sent: Thursday, 02 September, 2021 2:23
>>>> To: R meta
>>>> Subject: [R-meta] Wald_test - is it powerful enough?
>>>>
>>>> Hello,
>>>>
>>>> I hope you are well.
>>>> Is the Wald_test a lot less powerful than the QM test? I ask this because
>>>> in the example below the QM test is significant but the Wald test is not,
>>>> shouldn't they be equivalent?
>>>> If it is indeed the case that the Wald_test is not powerful enough to
>>>> detect a difference, is there a good equivalent test more powerful than
>>> the
>>>> Wald test that can be used alongside the robumeta package?
>>>>
>>>> Best wishes,
>>>>
>>>> Catia
>>>>
>>>> *dat <- dat.assink2016*
>>>> *V <- impute_covariance_matrix(dat$vi, cluster=dat$study, r=0.6)*
>>>>
>>>> *# fit multivariate model with delinquency type as moderator*
>>>>
>>>> *res <- rma.mv(yi, V, mods = ~ deltype-1, random = ~
>>>> factor(esid) | study, data=dat)*
>>>> *res*
>>>>
>>>> *Multivariate Meta-Analysis Model (k = 100; method: REML)*
>>>>
>>>> *Variance Components:*
>>>>
>>>> *outer factor: study (nlvls = 17)*
>>>> *inner factor: factor(esid) (nlvls = 22)*
>>>>
>>>> *estim sqrt fixed*
>>>> *tau^2 0.2150 0.4637 no*
>>>> *rho 0.3990 no*
>>>>
>>>> *Test for Residual Heterogeneity:*
>>>> *QE(df = 97) = 639.0911, p-val < .0001*
>>>>
>>>> *Test of Moderators (coefficients 1:3):*
>>>> *QM(df = 3) = 28.0468, p-val < .0001*
>>>>
>>>> *Model Results:*
>>>>
>>>> * estimate se zval pval ci.lb ci.ub
>>>> *deltypecovert -0.2902 0.2083 -1.3932 0.1635 -0.6984 0.1180
>>>> *deltypegeneral 0.4160 0.0975 4.2688 <.0001 0.2250 0.6070***
>>>> *deltypeovert 0.1599 0.1605 0.9963 0.3191 -0.1546 0.4743
>>>>
>>>> *---*
>>>> *Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*
>>>>
>>>> *Wald_test(res, constraints=constrain_zero(1:3), vcov="CR2",
>>>> cluster=dat$study)*
>>>>
>>>> *test Fstat df_num df_denom p_val sig*
>>>> * HTZ 40.9 3 1.08 0.0998 .*
>>>>
>>>> Thank you,
>>>>
>>>> Catia
>>> _______________________________________________
>>> R-sig-meta-analysis mailing list
>>> R-sig-meta-analysis using r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis
>>>
>>
>
--
Michael
http://www.dewey.myzen.co.uk/home.html
More information about the R-sig-meta-analysis
mailing list