[R] Faster Subsetting
Martin Morgan
martin.morgan at roswellpark.org
Wed Sep 28 22:37:29 CEST 2016
On 09/28/2016 02:53 PM, Hervé Pagès wrote:
> Hi,
>
> I'm surprised nobody suggested split(). Splitting the data.frame
> upfront is faster than repeatedly subsetting it:
>
> tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000))
> idList <- unique(tmp$id)
>
> system.time(for (i in idList) tmp[which(tmp$id == i),])
> # user system elapsed
> # 16.286 0.000 16.305
>
> system.time(split(tmp, tmp$id))
> # user system elapsed
> # 5.637 0.004 5.647
an odd speed-up is to provide (non-sequential) row names, e.g.,
> system.time(split(tmp, tmp$id))
user system elapsed
4.472 0.648 5.122
> row.names(tmp) = rev(seq_len(nrow(tmp)))
> system.time(split(tmp, tmp$id))
user system elapsed
0.588 0.000 0.587
for reasons explained here
http://stackoverflow.com/questions/39545400/why-is-split-inefficient-on-large-data-frames-with-many-groups/39548316#39548316
Martin
>
> Cheers,
> H.
>
> On 09/28/2016 09:09 AM, Doran, Harold wrote:
>> I have an extremely large data frame (~13 million rows) that resembles
>> the structure of the object tmp below in the reproducible code. In my
>> real data, the variable, 'id' may or may not be ordered, but I think
>> that is irrelevant.
>>
>> I have a process that requires subsetting the data by id and then
>> running each smaller data frame through a set of functions. One
>> example below uses indexing and the other uses an explicit call to
>> subset(), both return the same result, but indexing is faster.
>>
>> Problem is in my real data, indexing must parse through millions of
>> rows to evaluate the condition and this is expensive and a bottleneck
>> in my code. I'm curious if anyone can recommend an improvement that
>> would somehow be less expensive and faster?
>>
>> Thank you
>> Harold
>>
>>
>> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>>
>> idList <- unique(tmp$id)
>>
>> ### Fast, but not fast enough
>> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>>
>> ### Not fast at all, a big bottleneck
>> system.time(replicate(500, subset(tmp, id == idList[1])))
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
This email message may contain legally privileged and/or...{{dropped:2}}
More information about the R-help
mailing list