[R] Problem with "Cannot compute correct p-values with ties"

(Ted Harding) Ted.Harding at manchester.ac.uk
Wed Dec 23 19:14:05 CET 2009


The issue of ties in the Mann-Whitney test needs some thought.
The distribution function of the Mann-Whitney test is derived
on the assumption that (in effect) the data are continuous
variables so that (theoretically) there should be no ties.
Whn ties occur, then this assumjtion has failed.

1. If the data represent a continuous underlying variable which
has been recorded to a relatively coarse precision ("binned"),
so that some ties are likely, then random "tie-breaking" is
a plausible solution. In effect, a tie-cluster of size k would
then represent k unequal observations which could have been in
any one of k! orders (indistinguishable from the data in hand).

On the assumption that the "bin" width is so small that their
possible distinct unobserved values can not be so different
that any difference would materially effect the probabilities
of different orderings (i.e. they can be considered as if they
were uniformly distributed over the "bin"), then these k! orders
can be considered equally likely. Then the "jittering" (adding
small independent noise values to each of the equal data) will
yield one of these k! orderings with the same probability for
each. Then, when all the tie-clusters have been broken in this
way, the P-value for the Mann-Whitney will be exactly correct
for that particular breaking of the ties.

However, this is just one of the k! possible orderings; similarly
for any other tie-clusters in the data.

A different "jittering" would yield a different ordering, and
a different P-value. So what to choose? Well, you have to recognise
that they are all possible as far as the data tell, and all equally
likely. So an appropriate approach is to simulate a lot of random
tie-breaks, getting a P-value for each, and ending up with an
adequately large sample of random P-values.

What you do with this depends on what you need, and on how they
are distributed. If, for instance, all of 10000 P-values were
less than, say, 0.001, and 0.001 was an adequate P-value for
your purposes, then you can be very confident that you have a
"significant result" -- in other words, if you had known the
exact underlying values (with no ties), then it is almost certain
that you would still have got a P < 0.001 test result.

Similarly, maybe 2 out of 10000 are greater than 0.01, the rest less.
Then you can be fairly confident that the "true" P-value is less
than 0.01.

Or you could estimate the "true" P-value as the mean of the simulated
ones (preferable with a Standard Error too). Indeed, you could simply
compute a confidence interval for the P-value (but you would have
to choose the confidence level).

If there are only a few ties in the data, then a complete enumeration
of all possible tie-breaks is feasible. You then have everything you
could possibly need to know, given the data, relative to this approach.

2. If the data represent essentially discrete values (e.g. they
are count data, or ordered categorical), where ties are intrinsically
possible, then strictly speaking the Mann-Whitney test is not
appropriate, since its distribution function depends on the assumption
of continuity which is not true here.

However, nothing prevents you adopting the Mann=-Whitney statistic
as your test statistic of choice. The only problem is that you
may not refer its valuer to the Mann-Whitney distribution.

If there are k ordered categories C1 < C2 < ... < Ck, then the
Null Hypothesis is that Prob(X in Cj) is the same for each of
the two groups of data. It is then possible to devise a "permutation
test", whose evaluation for the data in hand could again be achieved
by random simulation. But you're also getting into contingency table
territory here, which is a somwhat different kind of universe!

Hoping this adds something useful!
Ted.

On 23-Dec-09 17:32:50, Greg Snow wrote:
> Adding random noise to data in order to avoid a warning is like
> removing the batteries from a smoke detector to silence it rather than
> investigating the what is causing the alarm to go off.
> 
> If the function is giving a warning it is best to investigate why, it
> is possible that you can ignore the warning (the burnt toast of smoke
> alarm analogies) but it is best to convince yourself that it is ok.  It
> is also possible in this case that another tool may be more
> appropriate, and investigating the warning could help you find that
> tool.
> 
> -- 
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
> 
> 
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Bernardo Rangel Tura
>> Sent: Tuesday, December 22, 2009 1:16 AM
>> To: ivorytower at emails.bjut.edu.cn
>> Cc: R-help at r-project.org
>> Subject: Re: [R] Problem with "Cannot compute correct p-values with
>> ties"
>> 
>> On Wed, 2009-12-02 at 16:52 +0800, Zhijiang Wang wrote:
>> > Dear All,
>> >    1. why did the problem happen?
>> >    2. How to solve it?
>> >
>> >    --
>> >
>> > Best wishes,
>> > Zhijiang Wang
>> 
>> 
>> Well... The algorithm for Mann-whitney test have problem with ties
>> 
>> To solve you can use jitter
>> 
>> a<-1:10
>> b<-1:10
>> wilcox.test(a,b)
>> 
>>      Wilcoxon rank sum test with continuity correction
>> 
>> data:  a and b
>> W = 50, p-value = 1
>> alternative hypothesis: true location shift is not equal to 0
>> 
>> Warning message:
>> In wilcox.test.default(a, b) : cannot compute exact p-value with ties
>> 
>> wilcox.test(a,jitter(b))
>> 
>>      Wilcoxon rank sum test
>> 
>> data:  a and jitter(b)
>> W = 49, p-value = 0.9705
>> alternative hypothesis: true location shift is not equal to 0
>> 
>> look ?jitter for more information
>> 
>> --
>> Bernardo Rangel Tura, M.D,MPH,Ph.D
>> National Institute of Cardiology
>> Brazil

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 23-Dec-09                                       Time: 18:14:03
------------------------------ XFMail ------------------------------




More information about the R-help mailing list