[R] Done: Fast way of finding top-n values of a long vector
Allan Engelhardt
allane at cybaea.com
Fri Jun 5 10:09:56 CEST 2009
I'm all done now. The "max2" version below is what I went with in the
end for my proposed change to caret::nearZeroVar (which used the "sort"
method). Max Kuhn will make it available on CRAN soon. It speeds up
that routine by a factor 2-5 on my test cases and uses much less
memory. For what it is worth, I also made a C version ("cmax" below)
which of course is faster yet again and scales nicely for returning the
top n values of the array:
cmax <- function (v) {max <- vector("double",2); max <- .C("test",
as.double(v), as.integer(length(v)), max, NAOK=TRUE)[[3]];
return(max[1]/max[2]);}
library("rbenchmark")
set.seed(1); x <- runif(1e7, max=1e8); x[1] <- NA;
benchmark(
replications=20,
columns=c("test","elapsed"),
order="elapsed"
, sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];}
, qsrt = {a<-sort(x, decreasing=TRUE, na.last=NA, method="quick")[1:2];
a[1]/a[2];}
, part = {a<-sort.int(-x, partial=1:2, na.last=NA)[1:2]; a[1]/a[2];}
, max1 = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1];
m/max(x[-w],na.rm=TRUE);}
, max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
, cmax = {cmax(x);}
)
# test elapsed
# 6 cmax 4.394
# 5 max2 8.954
# 4 max1 18.835
# 3 part 21.749
# 2 qsrt 46.692
# 1 sort 77.679
Thanks for all the suggestions and comments.
Allan.
PS: Slightly off-topic but is there a way within the syntax of R to set
up things so that 'sort' (or any function) would know it is called in a
partial list context in sort(x)[1:2] and it therefore could choose to
use the "partial" argument automatically for small [] lists? The R
interpreter of course knows full well that it is going to drop all but
the first two values of the result before it calls 'sort'. Perl has
'use Want' where howmany() and want(n) provides a subset of this
functionality (essentially for [] lists of the form 1:n).
More information about the R-help
mailing list