[R] Faster Subsetting

Wed Sep 28 18:26:45 CEST 2016

I regularly crunch through this amount of data with tidyverse. You can also
try the data.table package. They are optimized for speed, as long as you
have the memory.
Dominik

On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org> wrote:

> I have an extremely large data frame (~13 million rows) that resembles the
> structure of the object tmp below in the reproducible code. In my real
> data, the variable, 'id' may or may not be ordered, but I think that is
> irrelevant.
>
> I have a process that requires subsetting the data by id and then running
> each smaller data frame through a set of functions. One example below uses
> indexing and the other uses an explicit call to subset(), both return the
> same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of rows
> to evaluate the condition and this is expensive and a bottleneck in my
> code.  I'm curious if anyone can recommend an improvement that would
> somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]