[R] Mann-Whitney U

Wed Aug 15 22:13:11 CEST 2007

On Wed, 2007-08-15 at 12:06 -0600, Natalie O'Toole wrote:
> Hi,
> 
> I do want to use the Mann-Whitney test which ranks my data and then
> uses 
> those ranks rather than the actual data.
> 
> Here is the R code i am using:
> 
>  group1<- 
> c(1.34,1.47,1.48,1.49,1.62,1.67,1.7,1.7,1.7,1.73,1.81,1.84,1.9,1.96,2,2,2.19,2.29,2.29,2.41,2.41,2.46,2.5,2.6,2.8,2.8,3.07,3.3)
> > group2<- 
> c(0.98,1.18,1.25,1.33,1.38,1.4,1.49,1.57,1.72,1.75,1.8,1.82,1.86,1.9,1.97,2.04,2.14,2.18,2.49,2.5,2.55,2.57,2.64,2.73,2.77,2.9,2.94,NA)
> > result <-  wilcox.test(group1, group2, paired=FALSE, conf.level =
> 0.95, 
> na.action)

You did not specify a value for the na.action argument, hence the error
message you are getting.

It defaults to 'na.omit', unless you have modified R's options.
See ?na.action for more information.

In this case, it will remove any NA values from the two vectors prior to
calculating the statistic.

The additional arguments are really superfluous here. You can simply
use:

  wilcox.test(group1, group2)

> paired = FALSE so that the Wilcoxon rank sum test which is equivalent
> to 
> the Mann-Whitney test is used (my samples are NOT paired).
> conf.level = 0.95 to specify the confidence level
> na.action is used because i have a NA value (i suspect i am not using 
> na.action in the correct manner)
> 
> When i use this code i get the following error message:
> 
> Error in arg == choices : comparison (1) is possible only for atomic
> and 
> list types
> 
> When i use this code:
> 
>  group1<- 
> c(1.34,1.47,1.48,1.49,1.62,1.67,1.7,1.7,1.7,1.73,1.81,1.84,1.9,1.96,2,2,2.19,2.29,2.29,2.41,2.41,2.46,2.5,2.6,2.8,2.8,3.07,3.3)
> > group2<- 
> c(0.98,1.18,1.25,1.33,1.38,1.4,1.49,1.57,1.72,1.75,1.8,1.82,1.86,1.9,1.97,2.04,2.14,2.18,2.49,2.5,2.55,2.57,2.64,2.73,2.77,2.9,2.94,NA)
> > result <-  wilcox.test(group1, group2, paired=FALSE, conf.level =
> 0.95)
> 
> I get the following result:
> 
>   Wilcoxon rank sum test with continuity correction
> 
> data:  group1 and group2 
> W = 405.5, p-value = 0.6494
> alternative hypothesis: true location shift is not equal to 0 
> 
> Warning message:
> cannot compute exact p-value with ties in:
> wilcox.test.default(group1, 
> group2, paired = FALSE, conf.level = 0.95) 
> 
> The W value here is 405.5 with a p-value of 0.6494
> 
> 
> in SPSS, i am ranking my data and then performing a Mann-Whitney U by 
> selecting analyze - non-parametric tests - 2 independent samples  and
> then 
> checking off the Mann-Whitney U test.
> 
> For the Mann-Whitney test in SPSS i am gettting the following results:
> 
> Mann-Whitney U = 350.5
>  2- tailed p value = 0.643
> 
> I think maybe the descrepancy has to do with the specification of the
> NA 
> values in R, but i'm not sure.
> 
> 
> If anyone has any suggestions, please let me know!
> 
> I hope i have provided enough information to convey my problem.
> 
> Thank-you, 
> 
> Nat

It would appear that SPSS is reversing the two groups in it's
calculation and NOT using a correction by default.

If you review the internal code for wilcox.test(), by using:

  stats:::wilcox.test.default

you can see that the relevant code in this case is:

  r <- rank(c(x - mu, y))
  n.x <- as.double(length(x))
  n.y <- as.double(length(y))

  STATISTIC <- sum(r[seq_along(x)]) - n.x * (n.x + 1)/2

Thus, if we use 'x' and 'y' for your two groups, respectively, we get:

x <- c(1.34,1.47,1.48,1.49,1.62,1.67,1.7,1.7,1.7,1.73,1.81,1.84,
       1.9,1.96, 2,2,2.19,2.29,2.29,2.41,2.41,2.46,2.5,2.6,2.8,2.8,
       3.07,3.3)

y <- c(0.98,1.18,1.25,1.33,1.38,1.4,1.49,1.57,1.72,1.75,1.8,1.82,
       1.86,1.9,1.97,2.04,2.14,2.18,2.49,2.5,2.55,2.57,2.64,2.73,
       2.77,2.9,2.94,NA)

mu <- 0

# Now remove the NA values
x <- na.omit(x)
y <- na.omit(y)

r <- rank(c(x - mu, y))
n.x <- as.double(length(x))
n.y <- as.double(length(y))

> r
 [1]  5.0  8.0  9.0 10.5 13.0 14.0 16.0 16.0 16.0 19.0 22.0 24.0 26.5
[14] 28.0 30.5 30.5 35.0 36.5 36.5 38.5 38.5 40.0 42.5 46.0 50.5 50.5
[27] 54.0 55.0  1.0  2.0  3.0  4.0  6.0  7.0 10.5 12.0 18.0 20.0 21.0
[40] 23.0 25.0 26.5 29.0 32.0 33.0 34.0 41.0 42.5 44.0 45.0 47.0 48.0
[53] 49.0 52.0 53.0

> n.x
[1] 28

> n.y
[1] 27

STATISTIC <- sum(r[seq_along(x)]) - n.x * (n.x + 1)/2

> STATISTIC
[1] 405.5

This is the value you get with R as you have used it.

Now, to replicate the statistic in SPSS, use the following code, with x
and y interchanged:

r <- rank(c(y - mu, x))
n.x <- as.double(length(x))
n.y <- as.double(length(y))

STATISTIC <- sum(r[seq_along(y)]) - n.y * (n.y + 1)/2

So we get:

> r
 [1]  1.0  2.0  3.0  4.0  6.0  7.0 10.5 12.0 18.0 20.0 21.0 23.0 25.0
[14] 26.5 29.0 32.0 33.0 34.0 41.0 42.5 44.0 45.0 47.0 48.0 49.0 52.0
[27] 53.0  5.0  8.0  9.0 10.5 13.0 14.0 16.0 16.0 16.0 19.0 22.0 24.0
[40] 26.5 28.0 30.5 30.5 35.0 36.5 36.5 38.5 38.5 40.0 42.5 46.0 50.5
[53] 50.5 54.0 55.0

> n.x
[1] 28

> n.y
[1] 27

> STATISTIC
[1] 350.5

So we now match SPSS' calculation of the statistic.

Now, to complete the process and replicate the SPSS results fully, you
could do the following, by reversing the order of your arguments and
setting 'correct = FALSE'. I am using 'x' and 'y' here, but use 'group1'
and 'group2' on your system:

> wilcox.test(y, x, correct = FALSE)

	Wilcoxon rank sum test

data:  y and x 
W = 350.5, p-value = 0.6433
alternative hypothesis: true location shift is not equal to 0 

Warning message:
cannot compute exact p-value with ties in: wilcox.test.default(y, x,
correct = FALSE) 

BTW, I located a ftp site with SPSS' algorithm documentation online at:

  ftp://ftp.spss.com/pub/spss/statistics/spss/algorithms/

For the MW test, the relevant document is npart.pdf.

HTH,

Marc Schwartz