[R] Faster Subsetting
Dénes Tóth
toth.denes at ttk.mta.hu
Thu Sep 29 00:55:32 CEST 2016
Hi Harold,
Generally: you can not beat data.table, unless you can represent your
data in a matrix (or array or vector). For some specific cases, Hervé's
suggestion might be also competitive.
Your problem is that you did not put any effort to read at least part of
the very extensive documentation of the data.table package. You should
start here: https://github.com/Rdatatable/data.table/wiki/Getting-started
To put in a nutshell: use a key which allows binary search instead of
the much-much slower vector scan. (With the automatic auto-indexing
feature of the data.table package, you may even skip this step.) The
point is that creating the key must be done only once, and all
subsequent subsetting operations which use the key become incredibly
fast. You missed this point because you replicated the creation of the
key as well, not only the subsetting in one of your examples.
Here is a version of Herve's example (OK, it is a bit biased because
data.table has a highly optimized internal version of mean() for
calculating the group means):
## create a keyed data.table
tmp_dt <- data.table(id = rep(1:20000, each = 10), foo = rnorm(200000),
key = "id")
system.time(tmp_dt[, .(result = mean(foo)), by = id])
# user system elapsed
# 0.004 0.000 0.005
## subset a keyed data.table
all_ids <- tmp_dt[, unique(id)]
select_id <- sample(all_ids, 1)
system.time(tmp_dt[.(select_id)])
# user system elapsed
# 0.000 0.000 0.001
## or equivalently
system.time(tmp_dt[id == select_id])
# user system elapsed
# 0.000 0.000 0.001
Note: the CRAN version of the data.table package is already very fast,
but you should try the developmental version (
devtools::install_github("Rdatatable/data.table") ) for multi-threaded
subsetting.
Cheers,
Denes
On 09/28/2016 08:53 PM, Hervé Pagès wrote:
> Hi,
>
> I'm surprised nobody suggested split(). Splitting the data.frame
> upfront is faster than repeatedly subsetting it:
>
> tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000))
> idList <- unique(tmp$id)
>
> system.time(for (i in idList) tmp[which(tmp$id == i),])
> # user system elapsed
> # 16.286 0.000 16.305
>
> system.time(split(tmp, tmp$id))
> # user system elapsed
> # 5.637 0.004 5.647
>
> Cheers,
> H.
>
> On 09/28/2016 09:09 AM, Doran, Harold wrote:
>> I have an extremely large data frame (~13 million rows) that resembles
>> the structure of the object tmp below in the reproducible code. In my
>> real data, the variable, 'id' may or may not be ordered, but I think
>> that is irrelevant.
>>
>> I have a process that requires subsetting the data by id and then
>> running each smaller data frame through a set of functions. One
>> example below uses indexing and the other uses an explicit call to
>> subset(), both return the same result, but indexing is faster.
>>
>> Problem is in my real data, indexing must parse through millions of
>> rows to evaluate the condition and this is expensive and a bottleneck
>> in my code. I'm curious if anyone can recommend an improvement that
>> would somehow be less expensive and faster?
>>
>> Thank you
>> Harold
>>
>>
>> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>>
>> idList <- unique(tmp$id)
>>
>> ### Fast, but not fast enough
>> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>>
>> ### Not fast at all, a big bottleneck
>> system.time(replicate(500, subset(tmp, id == idList[1])))
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
More information about the R-help
mailing list