[R] Rank-based p-value on large dataset

Fri Mar 4 01:15:41 CET 2005

On 3/3/05 19:04, "Adaikalavan Ramasamy" <ramasamy at cancer.org.uk> wrote:

> One solution is to cut() 'x' according to the breaks defined by 'y'.
> Using cut with labels=FALSE is really fast. See a simulation below.
> 
> However the accuracy depends on the number of ties you have in your
> "empirical" distribution. I have tried to simulate with the round()
> function below.
> 
> # simulate data
> y <- round(rnorm(80000),  5) # empirical distribution with some ties
> x <- round(rnorm(130000), 5) # observed values with some ties
> 
> # current way
> system.time({
> p1 <- numeric( length(x) )
> for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) }
> })
> [1] 484.67  25.08 512.04   0.00   0.00
> 
> # suggested solution
> system.time({
> break.points <- c(-Inf, unique(sort(y)), Inf)
> p2 <- cut( x, breaks=break.points, labels=FALSE )
> p2 <- 1 - p2/length(break.points)
> })
> [1] 0.27 0.01 0.28 0.00 0.00
> 
> 
> head( cbind( p1, p2 ) )
>          p1         p2
> [1,] 0.658375 0.65813482
> [2,] 0.144000 0.14533705
> [3,] 0.815500 0.81436898
> [4,] 0.412500 0.41269640
> [5,] 0.553625 0.55334516
> [6,] 0.044500 0.04510897
> ...
> 
> cor(p1, p2)
> [1] 0.9999987
> 
> 
> The difference arises mainly because I had to use a unique breakpoints
> in cut(). You may want to investigate further if you are interested.
> Please let me know if you find anything good or bad about this
> suggestion as I am using it as part of my codes too. Thank you.
> 
> Regards, Adai
>

Adai,

All I have to say is, "WOW!".  I'll give it a try tomorrow and let you know.

Thanks,
Sean