[R] Faster Subsetting
Doran, Harold
HDoran at air.org
Wed Sep 28 19:07:36 CEST 2016
Many thanks. I did also try the filter function in dplyr and was also much slower than simply indexing in the original way the code had.
system.time(replicate(500, filter(tmp, id == idList[1])))
I did this on the toy example as well as the real data, finding the same (slower) result each time compared to the indexing method.
Perhaps I'm using it incorrectly?
-----Original Message-----
From: Constantin Weiser [mailto:constantin.weiser at hhu.de]
Sent: Wednesday, September 28, 2016 12:55 PM
To: r-help at r-project.org
Cc: Doran, Harold <HDoran at air.org>
Subject: Re: [R] Faster Subsetting
I just modified the reproducible example a bit, so it's a bit more realistic. The function "mean" could be "easily" replaced by your analysis.
And here are some possible solutions:
tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(200000)) tmp <- tmp[sample(dim(tmp)[1]),] # re-sampling the dataset
## with specialized packages
require(plyr)
system.time({
res1 <- ddply(tmp, .(id), summarize, mean=mean(foo))
})
require(dplyr)
system.time({
res2 <- tmp %>%
group_by(id) %>%
summarise(mean = mean(foo))
})
library(data.table)
system.time({
res3 <- data.table(tmp)[, list(mean=mean(foo)), by=id]
})
## build-in R-methods
system.time({
res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean)
})
system.time({
res5 <- sapply(unique(tmp$id), simplify = TRUE,
FUN = function(x){
c(id=x, mean=mean(tmp[which(tmp$id == x), "foo"]))
})
})
res5 <- t(res5)
system.time({
res5 <- sapply(unique(tmp$id), simplify = TRUE,
FUN = function(x){
sub.tmp <- subset(tmp, tmp$id == x)
c(x,mean=mean(sub.tmp[, "foo"]))
})
})
res5 <- t(res5)
Yours
Constantin
--
^
| X
| /eiser, Dr. Constantin (weiserc at hhu.de)
| /Chair of Statistics and Econometrics
| / Heinrich Heine-University of Düsseldorf
| * /\ / Universitätsstraße 1, 40225 Düsseldorf, Germany
| \ / \ / Oeconomicum (Building 24.31), Room 01.22
| \/ \/ Tel: 0049 211 81-15307
+----------------------------------------------------------->
Am 28.09.2016 um 18:28 schrieb Doran, Harold:
> Thank you very much. I don’t know tidyverse, I’ll look at that now. I
> did some tests with data.table package, but it was much slower on my
> machine, see examples below
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
>
> library(data.table)
>
> tmp2 <- as.data.table(tmp) # data.table
>
> system.time(replicate(500, tmp2[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp2, id == idList[1])))
>
> From: Dominik Schneider [mailto:dosc3612 at colorado.edu]
> Sent: Wednesday, September 28, 2016 12:27 PM
> To: Doran, Harold <HDoran at air.org>
> Cc: r-help at r-project.org
> Subject: Re: [R] Faster Subsetting
>
> I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory.
> Dominik
>
> On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote:
> I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant.
>
> I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code. I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> ______________________________________________
> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
> UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list