[R] Data Frame Indexing
Jesse Brown
jesse.r.brown at lmco.com
Mon Aug 22 14:13:32 CEST 2011
Hello,
I've been dealing with a set of values that contain time stamps and part
of my summary needs to look at just weekend data. In trying to limit the
data I've found a large difference in performance in the way I index a
data frame. I've constructed a minimal example here to try to explain my
observation.
is.weekend <- function(x) {
tm <- as.POSIXlt(x,origin="1970/01/01")
format(tm,"%a") %in% c("Sat","Sun")
}
use.lapply <- function(data) {
data[do.call(rbind,lapply(data$TIME,FUN=is.weekend)),]
}
use.sapply <- function(data) {
data[sapply(data$TIME,FUN=is.weekend),]
}
use.vapply <- function(data) {
data[vapply(data$TIME,FUN=is.weekend,FALSE),]
}
use.indexing <- function(data) {
data[is.weekend(data$TIME),]
}
And the results of these methods:
> names(csv.data)
[1] "TIME" "FILE" "RADIAN" "BITS" "DURATION"
> length(csv.data$TIME)
[1] 21471
> system.time(v1 <- use.lapply(csv.data))
user system elapsed
19.562 6.402 25.967
> system.time(v2 <- use.sapply(csv.data))
user system elapsed
19.456 6.492 25.951
> system.time(v3 <- use.vapply(csv.data))
user system elapsed
19.334 6.468 25.808
> system.time(v4 <- use.indexing(csv.data))
user system elapsed
0.032 0.020 0.052
> all(identical(v1,v2),identical(v2,v3),identical(v3,v4))
[1] TRUE
Forgive what is probably a trivial question, but why is there such a
large difference in the *apply functions as opposed to the direct
indexing method? On the surface it seems as though the use.indexing
method uses the entire vector as an argument to the function while the
others /might/ iterate over the values using one at a time as an
argument to the function. In either case all elements must be part of
the calculation...
Thanks for any insight.
Jesse
More information about the R-help
mailing list