[Rd] order(..., na.last = NA) performance hit

Murat Tasan mmuurr at gmail.com
Mon Jan 19 21:20:31 CET 2015


I've just recently noticed that using the na.last = NA setting with
order incurs a HUGE performance hit.
It appears that much of order(...) (the R wrapper, not the internal
calls) is written in as general a manner as possible to handle the
large number of input types.
But the canonical case of ordering a single vector of numerics suffers
greatly with the current implementation.
Below is a single trivial example, but overall I've been noticing
somewhere on the order of a 10X performance hit when using na.last =
NA.
Would it be worth (i) attempting a re-write of the wrapping order(...)
function, or (ii) at least mentioning the performance implications in
the help page for order(...)?

Here's an example of the performance hit:

x <- runif(1e6)
x[runif(1e6) > 0.9] <- NA ## add some (~10%) NA values
order2 <- function(x) {
    iix <- order(x, na.last = TRUE)
    iix[!is.na(x[iix])]
}

system.time(y1 <- order(x, na.last = TRUE))
##    user  system elapsed
##    0.48    0.00    0.48

system.time(y2 <- order(x, na.last = NA))
##    user  system elapsed
##   3.060   0.056   3.118

system.time(y3 <- order2(x))
##    user  system elapsed
##   0.520   0.004   0.520

all(y2 == y3)
## [1] TRUE
identical(y2, y3)
## [1] TRUE


Cheers,

-murat



More information about the R-devel mailing list