[R] Faster Subsetting

Wed Sep 28 19:00:54 CEST 2016

Don't do it this way. You are reinventing wheels.

1. Look at package dplyr, which has optimized functions to do exactly
this (break into subframes, calculate on subframes, reassemble).  Note
also that dplyr is part of tidyverse. I use base R functionality for
this because I know it and it does what I need, but dplyr may be
better for your needs.

2. In base R, this would be done by by(): so for your example,

by(tmp, tmp$id, FUN,... )

where FUN is a function that does whatever you want on each sub-data
frame. e.g. if you wanted to just take the mean of foo for each
subframe:

by(tmp, tmp$id, function(x)mean(x$foo))

## (but there are better ways of doing such a simple function in base
R or dplyr)

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Sep 28, 2016 at 9:28 AM, Doran, Harold <HDoran at air.org> wrote:
> Thank you very much. I don’t know tidyverse, I’ll look at that now. I did some tests with data.table package, but it was much slower on my machine, see examples below
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
>
> library(data.table)
>
> tmp2 <- as.data.table(tmp)     # data.table
>
> system.time(replicate(500, tmp2[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp2, id == idList[1])))
>
> From: Dominik Schneider [mailto:dosc3612 at colorado.edu]
> Sent: Wednesday, September 28, 2016 12:27 PM
> To: Doran, Harold <HDoran at air.org>
> Cc: r-help at r-project.org
> Subject: Re: [R] Faster Subsetting
>
> I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory.
> Dominik
>
> On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote:
> I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant.
>
> I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code.  I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> ______________________________________________
> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.