[R] Largest N Values Efficiently?

Mon Nov 12 07:56:12 CET 2007

What is 'x' here?  What type?  Does it contain NAs?  Are there ties?  R's 
ordering functions are rather general, and you can gain efficiency by 
ruling some of these out.

See ?sort, look at the 'partial' argument, including the comments in the 
Details.  And also look at ?sort.list.

sort.int(x) is more efficient than x[order(x)], and x[order(x)[1:n]] is 
more efficient than x[order(x)][1:n] for most types.

Finally, does efficiency matter?  As the examples in ?sort show, R can 
sort a vector of length 2000 is well under 1ms, and 1e7 random normals in 
less time than they take to generate.  There are not many tasks where 
gaining efficiency over x[order(x)][1:n] will be important.  E.g.

> system.time(x <- rnorm(1e6))
    user  system elapsed
    0.44    0.00    0.44
> system.time(x[order(x)][1:4])
    user  system elapsed
    1.72    0.00    1.72
> system.time(x2 <- sort.int(x, method = "quick")[1:4])
    user  system elapsed
    0.31    0.00    0.32
> system.time(min(x))
    user  system elapsed
    0.02    0.00    0.02
> system.time(x2 <- sort.int(x, partial=1)[1])
    user  system elapsed
    0.07    0.00    0.07

and do savings of tenths of a second matter?  (There is also 
quantreg::kselect, if you work out how to use it, which apparently is 
a bit faster at partial sorting on MacOS X but not elsewhere.)

On Sun, 11 Nov 2007, David Katz wrote:

>
> What is the most efficient alternative to x[order(x)][1:n] where
> length(x)>>n?

That is the smallest n values, pace your subject line.

> I also need the positions of the mins/maxs perhaps by preserving names.
>
> Thanks for any suggestions.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595