[R] Done: Fast way of finding top-n values of a long vector

Fri Jun 5 10:09:56 CEST 2009

I'm all done now.  The "max2" version below is what I went with in the 
end for my proposed change to caret::nearZeroVar (which used the "sort" 
method). Max Kuhn will make it available on CRAN soon.  It speeds up 
that routine by a factor 2-5 on my test cases and uses much less 
memory.  For what it is worth, I also made a C version ("cmax" below) 
which of course is faster yet again and scales nicely for returning the 
top n values of the array:

cmax <- function (v) {max <- vector("double",2); max <- .C("test", 
as.double(v), as.integer(length(v)), max, NAOK=TRUE)[[3]]; 
return(max[1]/max[2]);}

library("rbenchmark")
set.seed(1); x <- runif(1e7, max=1e8); x[1] <- NA;
benchmark(
 replications=20,
 columns=c("test","elapsed"),
 order="elapsed"
 , sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];}
 , qsrt = {a<-sort(x, decreasing=TRUE, na.last=NA, method="quick")[1:2]; 
a[1]/a[2];}
 , part = {a<-sort.int(-x, partial=1:2, na.last=NA)[1:2]; a[1]/a[2];}
 , max1 = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; 
m/max(x[-w],na.rm=TRUE);}
 , max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
 , cmax = {cmax(x);}
)
#   test elapsed
# 6 cmax   4.394
# 5 max2   8.954
# 4 max1  18.835
# 3 part  21.749
# 2 qsrt  46.692
# 1 sort  77.679

Thanks for all the suggestions and comments.

Allan.

PS: Slightly off-topic but is there a way within the syntax of R to set 
up things so that 'sort' (or any function) would know it is called in a 
partial list context in sort(x)[1:2] and it therefore could choose to 
use the "partial" argument automatically for small [] lists?  The R 
interpreter of course knows full well that it is going to drop all but 
the first two values of the result before it calls 'sort'.  Perl has 
'use Want' where howmany() and want(n) provides a subset of this 
functionality (essentially for [] lists of the form 1:n).