[Rd] Inconsistencies in wilcox.test

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Sat Dec 14 22:50:07 CET 2019


>>>>> Martin Maechler 
>>>>>     on Thu, 12 Dec 2019 17:20:47 +0100 writes:

>>>>> Karolis Koncevičius 
>>>>>     on Mon, 9 Dec 2019 23:43:36 +0200 writes:

    >> So I tried adding Infinity support for all cases.  And it
    >> is (as could be expected) more complicated than I
    >> thought.

    > "Of course !"  Thank you, Karolis, in any case!

    >> It is easy to add Inf support for the test. The problems
    >> start with conf.int=TRUE.

    >> Currently confidence intervals are computed via
    >> `uniroot()` and, in the case of infinities, we are
    >> computationally looking for roots over infinite interval
    >> which results in an error. I suspect this is the reason
    >> Inf values were removed in the first place.

    > Maybe. It's still wrong to be done "up front".  I'm sure
    > 98% (or so ;-) of all calls to wilcox.test() do *not* use
    > conf.int = TRUE


    >> Just a note, I found a few more errors/inconsistencies
    >> when requesting confidence intervals with paired=TRUE
    >> (due to Infinities being left in).

    >> Current error in Inf-Inf scenario:

    >> wilcox.test(c(1,2,Inf), c(4,8,Inf), paired=TRUE,
    >> conf.int=TRUE) Error in if (ZEROES) x <- x[x != 0] :
    >> missing value where TRUE/FALSE needed

    > Good catch .. notably as it also happens when
    > conf.int=FALSE as by default.  My version of wilcox.test()
    > now does give the same as when the to 'Inf' are left away.

    >> NaN confidence intervals:

    >> wilcox.test(c(1:9,Inf), c(21:28,Inf,30), paired=TRUE,
    >> conf.int=TRUE)

    >> Wilcoxon signed rank test with continuity correction

    >> data: c(1:9, Inf) and c(21:28, Inf, 30) V = 9.5, p-value
    >> = 0.0586 alternative hypothesis: true location shift is
    >> not equal to 0 0 percent confidence interval: NaN NaN
    >> sample estimates: midrange NaN

    > I don't see a big problem here. The NaN's in some sense
    > show the best that can be computed with this data.
    > Statistic and P-value, but no conf.int.


    >> The easiest "fix" for consistency would be to simply
    >> remove Infinity support from the paired=TRUE case.

    > I strongly disagree.  We are not pruning good
    > functionality just for some definition of consistency.

    >> But going with the more desirable approach of adding
    >> Infinity support for non-paired cases - it is currently
    >> not clear to me what confidence intervals and
    >> pseudomedian should be. Specially when Infinities are on
    >> both sides.

    > I deem that not to be a big deal.  They are not defined
    > given the default formulas and that is reflected by NA /
    > NaN in those parts of the result.

    >> Regards, Karolis Koncevičius.

    > But I have also spent a few hours now on
    > wilcox.test.default() behavior notably also looking at the
    > "rounding" / "machine precision" situation, and also on
    > your remark that the 'method: ...' does not indicate well
    > enough what was computed.

    > In my (not yet committed) but hereby proposed enhancement
    > of wilcox.test(), I have a new argument, 'digits.rank =
    > Inf' (the default 'Inf' corresponding to the current
    > behavior) with help page documentation:

    > digits.rank: a number; if finite, ‘rank(signif(r,
    > digits.rank))’ will be used to compute ranks for the test
    > statistic instead of (the default) ‘rank(r)’.

    > and then in 'Details :'

    >      For stability reasons, it may be advisable to use
    > rounded data or to set ‘digits.rank = 7’, say, such that
    > determination of ties does not depend on very small
    > numeric differences (see the example).

    > and then in 'Examples: '

    >      ## accuracy in ties determination via 'digits.rank':
    > wilcox.test( 4:2, 3:1, paired=TRUE) # Warning: cannot
    > compute exact p-value with ties wilcox.test((4:2)/10,
    > (3:1)/10, paired=TRUE) # no ties => *no* warning
    > wilcox.test((4:2)/10, (3:1)/10, paired=TRUE, digits.rank =
    > 9) # same ties as (4:2, 3:1)
     
    > ----------------------

    > Lastly, I propose to replace "test" by "exact test" in the
    > 'method' component (and print out) in case exact
    > computations were used.  This information should be part
    > of the returned "htest" object, and not only visible from
    > the arguments and warnings that are printed during the
    > computations.  This last change is in some sense the "most
    > back-incompatible" change of these, because many
    > wilcox.test() printouts would slightly change, e.g.,

    >> w0 <- wilcox.test( 1:5, 4*(0:4), paired=TRUE)

    > 	  Wilcoxon signed rank exact test

    >   data: 1:5 and 4 * (0:4) V = 1, p-value = 0.125
    > alternative hypothesis: true location shift is not equal
    > to 0

    > where before (in R <= 3.6.x) it is just

    > 	  Wilcoxon signed rank test

    >   data: .........  ...............  ...............

    > but I think R 4.0.0 is a good occasion for such a change.

    > Constructive feedback on all this is very welcome!  Martin

... none  ...  I "assume" this means everybody likes the idea ;-)

Anyway, now comitted to R-devel (for R 4.0.0), svn rev 77569
(in 'NEW FEATURES').

Martin 



    >> On 2019-12-07 23:18, Karolis Koncevičius wrote:
    >>> Thank you for a fast response. Nice to see this mailing
    >>> list being so alive.
    >>> 
    >>> Regarding Inf issue: I agree with your assessment that
    >>> Inf should not be removed. The code gave me an
    >>> impression that Inf values were intentionally removed
    >>> (since is.finite() was used everywhere, except for
    >>> paired case). I will try to adjust my patch according to
    >>> your feedback.
    >>> 
    >>> One more thing: it seems like you assumed that issues
    >>> 2:4 are all related to machine precision, which is not
    >>> the case - only 2nd issue is.  Just wanted to draw this
    >>> to your attention, in case you might have some feedback
    >>> and guidelines about issues 3 and 4 as well.
    >>> 
    >>> 
    >>> 
    >>> On 2019-12-07 21:59, Martin Maechler wrote:
    >>>>>>>>> Karolis Koncevičius on Sat, 7 Dec 2019 20:55:36
    >>>>>>>>> +0200 writes:
    >>>> 
    >>>> > Hello, > Writing to share some things I've found
    >>>> about wilcox.test() that seem a > a bit inconsistent.
    >>>> 
    >>>> > 1. Inf values are not removed if paired=TRUE
    >>>> 
    >>>> > # returns different results (Inf is removed): >
    >>>> wilcox.test(c(1,2,3,4), c(0,9,8,7)) >
    >>>> wilcox.test(c(1,2,3,4), c(0,9,8,Inf))
    >>>> 
    >>>> > # returns the same result (Inf is left as value with
    >>>> highest rank): > wilcox.test(c(1,2,3,4), c(0,9,8,7),
    >>>> paired=TRUE) > wilcox.test(c(1,2,3,4), c(0,9,8,Inf),
    >>>> paired=TRUE)
    >>>> 
    >>>> Now which of the two cases do you consider correct ?
    >>>> 
    >>>> IHMO, the 2nd one is correct: it is exactly one
    >>>> property of the *robustness* of the wilcoxon test and
    >>>> very desirable that any (positive) outlier is treated
    >>>> the same as just "the largest value" and the test
    >>>> statistic (and hence the p-value) should not change.
    >>>> 
    >>>> So I think the first case is wrong, notably if
    >>>> modified, (such that the last y is the overall maximal
    >>>> value (slightly larger sample):
    >>>> 
    >>>>> wilcox.test(1:7, 1/8+ c(9:4, 12))
    >>>> 
    >>>> Wilcoxon rank sum test
    >>>> 
    >>>> data: 1:7 and 1/8 + c(9:4, 12) W = 6, p-value = 0.01748
    >>>> alternative hypothesis: true location shift is not
    >>>> equal to 0
    >>>> 
    >>>>> wilcox.test(1:7, 1/8+ c(9:4, 10000))
    >>>> 
    >>>> Wilcoxon rank sum test
    >>>> 
    >>>> data: 1:7 and 1/8 + c(9:4, 10000) W = 6, p-value =
    >>>> 0.01748 alternative hypothesis: true location shift is
    >>>> not equal to 0
    >>>> 
    >>>>> wilcox.test(1:7, 1/8+ c(9:4, Inf))
    >>>> 
    >>>> Wilcoxon rank sum test
    >>>> 
    >>>> data: 1:7 and 1/8 + c(9:4, Inf) W = 6, p-value =
    >>>> 0.03497 alternative hypothesis: true location shift is
    >>>> not equal to 0
    >>>> 
    >>>> The Inf case should definitely give the same as the
    >>>> 10'000 case.  That's exactly one property of a robust
    >>>> statistic.
    >>>> 
    >>>> Thank you, Karolis, this is pretty embarrassing to only
    >>>> be detected now after 25+ years of R in use ...
    >>>> 
    >>>> The correct fix starts with replacing the is.finite()
    >>>> by !is.na() and keep the 'Inf' in the rank
    >>>> computations...  (but then probably also deal with the
    >>>> case of more than one Inf, notably the Inf - Inf
    >>>> "exception" which is not triggered by your example...)
    >>>> 
    >>>> 
    >>>> ---
    >>>> 
    >>>> Ben addressed the "rounding" / numerical issues
    >>>> unavoidable for the other problems.
    >>>> 
    >>>> > 2. tolerance issues with paired=TRUE.
    >>>> 
    >>>> > wilcox.test(c(4, 3, 2), c(3, 2, 1), paired=TRUE) > #
    >>>> ...  > # Warning: cannot compute exact p-value with
    >>>> ties
    >>>> 
    >>>> > wilcox.test(c(0.4,0.3,0.2), c(0.3,0.2,0.1),
    >>>> paired=TRUE) > # ...  > # no warning
    >>>> 
    >>>> > 3. Always 'x observations are missing' when
    >>>> paired=TRUE
    >>>> 
    >>>> > wilcox.test(c(1,2), c(NA_integer_,NA_integer_),
    >>>> paired=TRUE) > # ...  > # Error: not enough (finite)
    >>>> 'x' observations
    >>>> 
    >>>> > 4. No indication if normal approximation was used:
    >>>> 
    >>>> > # different numbers, but same "method" name >
    >>>> wilcox.test(rnorm(10), exact=FALSE, correct=FALSE) >
    >>>> wilcox.test(rnorm(10), exact=TRUE, correct=FALSE)
    >>>> 
    >>>> 
    >>>> > From all of these I am pretty sure the 1st one is
    >>>> likely unintended, > so attaching a small patch to
    >>>> adjust it. Can also try patching others if > consensus
    >>>> is reached that the behavioiur has to be modified.
    >>>> 
    >>>> > Kind regards, > Karolis Koncevičius.
    >>>>



More information about the R-devel mailing list