[R] Rank-based p-value on large dataset
Sean Davis
sdavis2 at mail.nih.gov
Fri Mar 4 01:15:41 CET 2005
On 3/3/05 19:04, "Adaikalavan Ramasamy" <ramasamy at cancer.org.uk> wrote:
> One solution is to cut() 'x' according to the breaks defined by 'y'.
> Using cut with labels=FALSE is really fast. See a simulation below.
>
> However the accuracy depends on the number of ties you have in your
> "empirical" distribution. I have tried to simulate with the round()
> function below.
>
> # simulate data
> y <- round(rnorm(80000), 5) # empirical distribution with some ties
> x <- round(rnorm(130000), 5) # observed values with some ties
>
> # current way
> system.time({
> p1 <- numeric( length(x) )
> for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) }
> })
> [1] 484.67 25.08 512.04 0.00 0.00
>
> # suggested solution
> system.time({
> break.points <- c(-Inf, unique(sort(y)), Inf)
> p2 <- cut( x, breaks=break.points, labels=FALSE )
> p2 <- 1 - p2/length(break.points)
> })
> [1] 0.27 0.01 0.28 0.00 0.00
>
>
> head( cbind( p1, p2 ) )
> p1 p2
> [1,] 0.658375 0.65813482
> [2,] 0.144000 0.14533705
> [3,] 0.815500 0.81436898
> [4,] 0.412500 0.41269640
> [5,] 0.553625 0.55334516
> [6,] 0.044500 0.04510897
> ...
>
> cor(p1, p2)
> [1] 0.9999987
>
>
> The difference arises mainly because I had to use a unique breakpoints
> in cut(). You may want to investigate further if you are interested.
> Please let me know if you find anything good or bad about this
> suggestion as I am using it as part of my codes too. Thank you.
>
> Regards, Adai
>
Adai,
All I have to say is, "WOW!". I'll give it a try tomorrow and let you know.
Thanks,
Sean
More information about the R-help
mailing list