[Rd] order(..., na.last = NA) performance hit
Murat Tasan
mmuurr at gmail.com
Mon Jan 19 21:20:31 CET 2015
I've just recently noticed that using the na.last = NA setting with
order incurs a HUGE performance hit.
It appears that much of order(...) (the R wrapper, not the internal
calls) is written in as general a manner as possible to handle the
large number of input types.
But the canonical case of ordering a single vector of numerics suffers
greatly with the current implementation.
Below is a single trivial example, but overall I've been noticing
somewhere on the order of a 10X performance hit when using na.last =
NA.
Would it be worth (i) attempting a re-write of the wrapping order(...)
function, or (ii) at least mentioning the performance implications in
the help page for order(...)?
Here's an example of the performance hit:
x <- runif(1e6)
x[runif(1e6) > 0.9] <- NA ## add some (~10%) NA values
order2 <- function(x) {
iix <- order(x, na.last = TRUE)
iix[!is.na(x[iix])]
}
system.time(y1 <- order(x, na.last = TRUE))
## user system elapsed
## 0.48 0.00 0.48
system.time(y2 <- order(x, na.last = NA))
## user system elapsed
## 3.060 0.056 3.118
system.time(y3 <- order2(x))
## user system elapsed
## 0.520 0.004 0.520
all(y2 == y3)
## [1] TRUE
identical(y2, y3)
## [1] TRUE
Cheers,
-murat
More information about the R-devel
mailing list