[R] Faster Subsetting

Martin Morgan martin.morgan at roswellpark.org
Wed Sep 28 22:37:29 CEST 2016


On 09/28/2016 02:53 PM, Hervé Pagès wrote:
> Hi,
>
> I'm surprised nobody suggested split(). Splitting the data.frame
> upfront is faster than repeatedly subsetting it:
>
>   tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000))
>   idList <- unique(tmp$id)
>
>   system.time(for (i in idList) tmp[which(tmp$id == i),])
>   #   user  system elapsed
>   # 16.286   0.000  16.305
>
>   system.time(split(tmp, tmp$id))
>   #   user  system elapsed
>   #  5.637   0.004   5.647

an odd speed-up is to provide (non-sequential) row names, e.g.,

 > system.time(split(tmp, tmp$id))
    user  system elapsed
   4.472   0.648   5.122
 > row.names(tmp) = rev(seq_len(nrow(tmp)))
 > system.time(split(tmp, tmp$id))
    user  system elapsed
   0.588   0.000   0.587

for reasons explained here

 
http://stackoverflow.com/questions/39545400/why-is-split-inefficient-on-large-data-frames-with-many-groups/39548316#39548316

Martin


>
> Cheers,
> H.
>
> On 09/28/2016 09:09 AM, Doran, Harold wrote:
>> I have an extremely large data frame (~13 million rows) that resembles
>> the structure of the object tmp below in the reproducible code. In my
>> real data, the variable, 'id' may or may not be ordered, but I think
>> that is irrelevant.
>>
>> I have a process that requires subsetting the data by id and then
>> running each smaller data frame through a set of functions. One
>> example below uses indexing and the other uses an explicit call to
>> subset(), both return the same result, but indexing is faster.
>>
>> Problem is in my real data, indexing must parse through millions of
>> rows to evaluate the condition and this is expensive and a bottleneck
>> in my code.  I'm curious if anyone can recommend an improvement that
>> would somehow be less expensive and faster?
>>
>> Thank you
>> Harold
>>
>>
>> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>>
>> idList <- unique(tmp$id)
>>
>> ### Fast, but not fast enough
>> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>>
>> ### Not fast at all, a big bottleneck
>> system.time(replicate(500, subset(tmp, id == idList[1])))
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the R-help mailing list