[R] Fastest way to calculate quantile in large data.table

Steve Lianoglou lianoglou.steve at gene.com
Thu Feb 5 21:16:05 CET 2015


Not sure if there is a question in here somewhere?

But if I can point out an observation: if you are doing summary
calculations across the rows like this, my guess is that using a
data.table (data.frame) structure for that will really bite you,
because this operation on a data.table/data.frame is expensive;

  x <- dt[i,]

However it's much faster with a matrix. It doesn't seem like you're
doing anything with this dataset that takes advantage of data.table's
quick grouping/indexing mojo, so why store it in  data.table at all?

Witness:

R> library(data.table)
R> m <- matrix(rnorm(1e6), nrow=10)
R> d <- as.data.table(m)
R> idxs <- sample(1:nrow(m), 500, replace=TRUE)

R> system.time(for (i in idxs) x <- m[i,])
   user  system elapsed
  0.497   0.169   0.670

R> system.time(for (i in idxs) x <- d[i,])
## I killed it after waiting for 14 seconds

-steve

On Thu, Feb 5, 2015 at 11:48 AM, Camilo Mora <cmora at dal.ca> wrote:
> In total I found 8 different way to calculate quantile in very a large data.table. I share below their performances for future reference. Tests 1, 7 and 8 were the fastest I found.
>
> Best,
>
> Camilo
>
> library(data.table)
> v <- data.table(x=runif(10000),x2 = runif(10000),  x3=runif(10000),x4=runif(10000))
>
> #fastest
> Sys.time()->StartTEST1
> t(v[, apply(v,1,quantile,probs =c(.1,.9,.5),na.rm=TRUE)] )
> Sys.time()->EndTEST1
>
> Sys.time()->StartTEST2
> v[, quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
> Sys.time()->EndTEST2
>
> Sys.time()->StartTEST3
> v[, c("L","H","M"):=quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
> Sys.time()->EndTEST3
> v
> v[, c("L","H","M"):=NULL]
>
> v[,Names:=rownames(v)]
> setkey(v,Names)
>
> Sys.time()->StartTEST4
> v[, c("L","H","M"):=quantile(.SD,probs =c(.1,.9,.5)), by = Names]
> Sys.time()->EndTEST4
> v
> v[, c("L","H","M"):=NULL]
>
>
> Sys.time()->StartTEST5
> v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names]
> Sys.time()->EndTEST5
>
>
> Sys.time()->StartTEST6
> v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names,.SDcols=1:4]
> Sys.time()->EndTEST6
>
>
> Sys.time()->StartTEST7
> v[,  as.list(quantile(c(x ,       x2,        x3,        x4 ),c(.1,.90,.5),na.rm=TRUE)), by=Names]
> Sys.time()->EndTEST7
>
>
> # melting the database and doing quantily by summary. This is the second fastest, which is ironic given that the database has to be melted first
> library(reshape2)
> Sys.time()->StartTEST8
> vs<-melt(v)
> vs[,  as.list(quantile(value,c(.1,.90,.5),na.rm=TRUE)), by=Names]
> Sys.time()->EndTEST8
>
>
> EndTEST1-StartTEST1
> EndTEST2-StartTEST2
> EndTEST3-StartTEST3
> EndTEST4-StartTEST4
> EndTEST5-StartTEST5
> EndTEST6-StartTEST6
> EndTEST7-StartTEST7
> EndTEST8-StartTEST8
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Genentech



More information about the R-help mailing list