[Rd] Subsetting a data frame vs. subsetting the columns
Hadley Wickham
hadley at rice.edu
Wed Dec 28 16:37:01 CET 2011
Hi all,
There seems to be rather a large speed disparity in subsetting when
working with a whole data frame vs. working with just columns
individually:
df <- as.data.frame(replicate(10, runif(1e5)))
ord <- order(df[[1]])
system.time(df[ord, ])
# user system elapsed
# 0.043 0.007 0.059
system.time(lapply(df, function(x) x[ord]))
# user system elapsed
# 0.022 0.008 0.029
What's going on?
I realise this isn't quite a fair example because the second case
makes a list not a data frame, but I thought it would be quick
operation to turn a list into a data frame if you don't do any
checking:
list_to_df <- function(list) {
n <- length(list[[1]])
structure(list,
class = "data.frame",
row.names = c(NA, -n))
}
system.time(list_to_df(lapply(df, function(x) x[ord])))
# user system elapsed
# 0.031 0.017 0.048
So I guess this is slow because it has to make a copy of the whole
data frame to modify the structure. But couldn't [.data.frame avoid
that?
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
More information about the R-devel
mailing list