[R] Faster Subsetting
ruipbarradas at sapo.pt
ruipbarradas at sapo.pt
Wed Sep 28 18:57:15 CEST 2016
Hello,
If you work with a matrix instead of a data.frame, it usually runs
faster, but your column vectors must all be numeric.
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
user system elapsed
0.05 0.00 0.04
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
user system elapsed
0.07 0.00 0.08
>
# Make it a matrix and use the matrix
> mattmp <- as.matrix(tmp)
> system.time(replicate(500, mattmp[which(mattmp[,"id"] == idList[1]),]))
user system elapsed
0.01 0.00 0.01
Hope this helps,
Rui Barradas
Citando Doran, Harold <HDoran at air.org>:
> I have an extremely large data frame (~13 million rows) that
> resembles the structure of the object tmp below in the reproducible
> code. In my real data, the variable, 'id' may or may not be ordered,
> but I think that is irrelevant.
>
> I have a process that requires subsetting the data by id and then
> running each smaller data frame through a set of functions. One
> example below uses indexing and the other uses an explicit call to
> subset(), both return the same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of
> rows to evaluate the condition and this is expensive and a
> bottleneck in my code. I'm curious if anyone can recommend an
> improvement that would somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list