[Rd] Subsetting a data frame vs. subsetting the columns

Wed Dec 28 17:14:29 CET 2011

Hadley,

there was a whole discussion about subsetting and subassigning data frames (and general efficiency issues) some time ago (I can't find it in a hurry but others might) -- just look at the `[.data.frame` code to see why it's so slow. It would need to be pushed into C code to allow certain optimizations, but it's a quite complex code so I don't think there were volunteers. So the advice is don't do it ;). Treating DFs as lists is always faster since you get to the fast internal code.

Cheers,
S

On Dec 28, 2011, at 10:37 AM, Hadley Wickham wrote:

> Hi all,
> 
> There seems to be rather a large speed disparity in subsetting when
> working with a whole data frame vs. working with just columns
> individually:
> 
> df <- as.data.frame(replicate(10, runif(1e5)))
> ord <- order(df[[1]])
> 
> system.time(df[ord, ])
> #   user  system elapsed
> #  0.043   0.007   0.059
> system.time(lapply(df, function(x) x[ord]))
> #   user  system elapsed
> #  0.022   0.008   0.029
> 
> What's going on?
> 
> I realise this isn't quite a fair example because the second case
> makes a list not a data frame, but I thought it would be quick
> operation to turn a list into a data frame if you don't do any
> checking:
> 
> list_to_df <- function(list) {
>  n <- length(list[[1]])
>  structure(list,
>    class = "data.frame",
>    row.names = c(NA, -n))
> }
> system.time(list_to_df(lapply(df, function(x) x[ord])))
> #    user  system elapsed
> #  0.031   0.017   0.048
> 
> So I guess this is slow because it has to make a copy of the whole
> data frame to modify the structure.  But couldn't [.data.frame avoid
> that?
> 
> Hadley
> 
> 
> -- 
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>