[Rd] Subsetting a data frame vs. subsetting the columns

Wed Dec 28 16:37:01 CET 2011

Hi all,

There seems to be rather a large speed disparity in subsetting when
working with a whole data frame vs. working with just columns
individually:

df <- as.data.frame(replicate(10, runif(1e5)))
ord <- order(df[[1]])

system.time(df[ord, ])
#   user  system elapsed
#  0.043   0.007   0.059
system.time(lapply(df, function(x) x[ord]))
#   user  system elapsed
#  0.022   0.008   0.029

What's going on?

I realise this isn't quite a fair example because the second case
makes a list not a data frame, but I thought it would be quick
operation to turn a list into a data frame if you don't do any
checking:

list_to_df <- function(list) {
  n <- length(list[[1]])
  structure(list,
    class = "data.frame",
    row.names = c(NA, -n))
}
system.time(list_to_df(lapply(df, function(x) x[ord])))
#    user  system elapsed
#  0.031   0.017   0.048

So I guess this is slow because it has to make a copy of the whole
data frame to modify the structure.  But couldn't [.data.frame avoid
that?

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/