[Rd] Subsetting a data frame vs. subsetting the columns
Joshua Wiley
jwiley.psych at gmail.com
Wed Dec 28 18:24:23 CET 2011
On Wed, Dec 28, 2011 at 8:14 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
> Hadley,
>
> there was a whole discussion about subsetting and subassigning data frames (and general efficiency issues) some time ago (I can't find it in a hurry but others might)
Yep, a rather lengthy discussion at that
http://r.789695.n4.nabble.com/speeding-up-perception-td3640920.html.
IIRC, there was also some off list stuff about what it would take to
push to C, which I may have in my inbox if anyone wants.
Cheers,
Josh
-- just look at the `[.data.frame` code to see why it's so slow. It
would need to be pushed into C code to allow certain optimizations,
but it's a quite complex code so I don't think there were volunteers.
So the advice is don't do it ;). Treating DFs as lists is always
faster since you get to the fast internal code.
>
> Cheers,
> S
>
>
> On Dec 28, 2011, at 10:37 AM, Hadley Wickham wrote:
>
>> Hi all,
>>
>> There seems to be rather a large speed disparity in subsetting when
>> working with a whole data frame vs. working with just columns
>> individually:
>>
>> df <- as.data.frame(replicate(10, runif(1e5)))
>> ord <- order(df[[1]])
>>
>> system.time(df[ord, ])
>> # user system elapsed
>> # 0.043 0.007 0.059
>> system.time(lapply(df, function(x) x[ord]))
>> # user system elapsed
>> # 0.022 0.008 0.029
>>
>> What's going on?
>>
>> I realise this isn't quite a fair example because the second case
>> makes a list not a data frame, but I thought it would be quick
>> operation to turn a list into a data frame if you don't do any
>> checking:
>>
>> list_to_df <- function(list) {
>> n <- length(list[[1]])
>> structure(list,
>> class = "data.frame",
>> row.names = c(NA, -n))
>> }
>> system.time(list_to_df(lapply(df, function(x) x[ord])))
>> # user system elapsed
>> # 0.031 0.017 0.048
>>
>> So I guess this is slow because it has to make a copy of the whole
>> data frame to modify the structure. But couldn't [.data.frame avoid
>> that?
>>
>> Hadley
>>
>>
>> --
>> Assistant Professor / Dobelman Family Junior Chair
>> Department of Statistics / Rice University
>> http://had.co.nz/
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/
More information about the R-devel
mailing list