[R] Faster Subsetting

Wed Sep 28 19:07:36 CEST 2016

Many thanks. I did also try the filter function in dplyr and was also much slower than simply indexing in the original way the code had.

system.time(replicate(500, filter(tmp, id == idList[1])))

I did this on the toy example as well as the real data, finding the same (slower) result each time compared to the indexing method.

Perhaps I'm using it incorrectly?

-----Original Message-----
From: Constantin Weiser [mailto:constantin.weiser at hhu.de] 
Sent: Wednesday, September 28, 2016 12:55 PM
To: r-help at r-project.org
Cc: Doran, Harold <HDoran at air.org>
Subject: Re: [R] Faster Subsetting

I just modified the reproducible example a bit, so it's a bit more realistic. The function "mean" could be "easily" replaced by your analysis.

And here are some possible solutions:

tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(200000)) tmp <- tmp[sample(dim(tmp)[1]),] # re-sampling the dataset

## with specialized packages
require(plyr)
system.time({
   res1 <- ddply(tmp, .(id), summarize, mean=mean(foo))
})

require(dplyr)
system.time({
   res2 <- tmp %>%
     group_by(id) %>%
     summarise(mean = mean(foo))
})

library(data.table)
system.time({
   res3 <- data.table(tmp)[, list(mean=mean(foo)), by=id]
})

## build-in R-methods
system.time({
   res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean)
})

system.time({
   res5 <- sapply(unique(tmp$id), simplify = TRUE,
                  FUN = function(x){
                    c(id=x, mean=mean(tmp[which(tmp$id == x), "foo"]))
                  })
})
res5 <- t(res5)

system.time({
   res5 <- sapply(unique(tmp$id), simplify = TRUE,
                  FUN = function(x){
                    sub.tmp <- subset(tmp, tmp$id == x)
                    c(x,mean=mean(sub.tmp[, "foo"]))
                  })
})
res5 <- t(res5)

Yours
Constantin

--
^
|                X
|               /eiser, Dr. Constantin (weiserc at hhu.de)
|              /Chair of Statistics and Econometrics
|             / Heinrich Heine-University of Düsseldorf
| *    /\    /  Universitätsstraße 1, 40225 Düsseldorf, Germany
|  \  /  \  /   Oeconomicum (Building 24.31), Room 01.22
|   \/    \/    Tel: 0049 211 81-15307
+----------------------------------------------------------->

Am 28.09.2016 um 18:28 schrieb Doran, Harold:
> Thank you very much. I don’t know tidyverse, I’ll look at that now. I 
> did some tests with data.table package, but it was much slower on my 
> machine, see examples below
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
>
> library(data.table)
>
> tmp2 <- as.data.table(tmp)     # data.table
>
> system.time(replicate(500, tmp2[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp2, id == idList[1])))
>
> From: Dominik Schneider [mailto:dosc3612 at colorado.edu]
> Sent: Wednesday, September 28, 2016 12:27 PM
> To: Doran, Harold <HDoran at air.org>
> Cc: r-help at r-project.org
> Subject: Re: [R] Faster Subsetting
>
> I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory.
> Dominik
>
> On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote:
> I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant.
>
> I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code.  I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> ______________________________________________
> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To 
> UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>