[R] Faster Subsetting

Wed Sep 28 19:21:31 CEST 2016

Note that for base R, by() is considerably faster than aggregate()
(both of which are *must* faster than the sapply() stuff; tapply() is
what is more appropriate here).

(for Constantin's example):

> system.time({
+   res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean)
+ })
   user  system elapsed
  0.289   0.006   0.295

> system.time(res6 <-by(tmp,tmp$id,function(x)mean(x$foo)))
   user  system elapsed
  0.124   0.005   0.130

Note, however, that the by() result may need to be manipulated to get
it into the format you want, and so additional time.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Sep 28, 2016 at 10:07 AM, Doran, Harold <HDoran at air.org> wrote:
> Many thanks. I did also try the filter function in dplyr and was also much slower than simply indexing in the original way the code had.
>
> system.time(replicate(500, filter(tmp, id == idList[1])))
>
> I did this on the toy example as well as the real data, finding the same (slower) result each time compared to the indexing method.
>
> Perhaps I'm using it incorrectly?
>
>
>
> -----Original Message-----
> From: Constantin Weiser [mailto:constantin.weiser at hhu.de]
> Sent: Wednesday, September 28, 2016 12:55 PM
> To: r-help at r-project.org
> Cc: Doran, Harold <HDoran at air.org>
> Subject: Re: [R] Faster Subsetting
>
> I just modified the reproducible example a bit, so it's a bit more realistic. The function "mean" could be "easily" replaced by your analysis.
>
> And here are some possible solutions:
>
> tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(200000)) tmp <- tmp[sample(dim(tmp)[1]),] # re-sampling the dataset
>
> ## with specialized packages
> require(plyr)
> system.time({
>    res1 <- ddply(tmp, .(id), summarize, mean=mean(foo))
> })
>
> require(dplyr)
> system.time({
>    res2 <- tmp %>%
>      group_by(id) %>%
>      summarise(mean = mean(foo))
> })
>
> library(data.table)
> system.time({
>    res3 <- data.table(tmp)[, list(mean=mean(foo)), by=id]
> })
>
>
> ## build-in R-methods
> system.time({
>    res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean)
> })
>
> system.time({
>    res5 <- sapply(unique(tmp$id), simplify = TRUE,
>                   FUN = function(x){
>                     c(id=x, mean=mean(tmp[which(tmp$id == x), "foo"]))
>                   })
> })
> res5 <- t(res5)
>
> system.time({
>    res5 <- sapply(unique(tmp$id), simplify = TRUE,
>                   FUN = function(x){
>                     sub.tmp <- subset(tmp, tmp$id == x)
>                     c(x,mean=mean(sub.tmp[, "foo"]))
>                   })
> })
> res5 <- t(res5)
>
>
> Yours
> Constantin
>
>
> --
> ^
> |                X
> |               /eiser, Dr. Constantin (weiserc at hhu.de)
> |              /Chair of Statistics and Econometrics
> |             / Heinrich Heine-University of Düsseldorf
> | *    /\    /  Universitätsstraße 1, 40225 Düsseldorf, Germany
> |  \  /  \  /   Oeconomicum (Building 24.31), Room 01.22
> |   \/    \/    Tel: 0049 211 81-15307
> +----------------------------------------------------------->
>
> Am 28.09.2016 um 18:28 schrieb Doran, Harold:
>> Thank you very much. I don’t know tidyverse, I’ll look at that now. I
>> did some tests with data.table package, but it was much slower on my
>> machine, see examples below
>>
>> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>>
>> idList <- unique(tmp$id)
>>
>> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>>
>> system.time(replicate(500, subset(tmp, id == idList[1])))
>>
>>
>> library(data.table)
>>
>> tmp2 <- as.data.table(tmp)     # data.table
>>
>> system.time(replicate(500, tmp2[which(tmp$id == idList[1]),]))
>>
>> system.time(replicate(500, subset(tmp2, id == idList[1])))
>>
>> From: Dominik Schneider [mailto:dosc3612 at colorado.edu]
>> Sent: Wednesday, September 28, 2016 12:27 PM
>> To: Doran, Harold <HDoran at air.org>
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Faster Subsetting
>>
>> I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory.
>> Dominik
>>
>> On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote:
>> I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant.
>>
>> I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster.
>>
>> Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code.  I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster?
>>
>> Thank you
>> Harold
>>
>>
>> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>>
>> idList <- unique(tmp$id)
>>
>> ### Fast, but not fast enough
>> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>>
>> ### Not fast at all, a big bottleneck
>> system.time(replicate(500, subset(tmp, id == idList[1])))
>>
>> ______________________________________________
>> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
>> UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.