[R] Faster Subsetting

Hervé Pagès hpages at fredhutch.org
Wed Sep 28 20:53:01 CEST 2016


Hi,

I'm surprised nobody suggested split(). Splitting the data.frame
upfront is faster than repeatedly subsetting it:

   tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000))
   idList <- unique(tmp$id)

   system.time(for (i in idList) tmp[which(tmp$id == i),])
   #   user  system elapsed
   # 16.286   0.000  16.305

   system.time(split(tmp, tmp$id))
   #   user  system elapsed
   #  5.637   0.004   5.647

Cheers,
H.

On 09/28/2016 09:09 AM, Doran, Harold wrote:
> I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant.
>
> I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code.  I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the R-help mailing list