[R] slow computation of functions over large datasets

jim holtman jholtman at gmail.com
Wed Aug 3 18:20:06 CEST 2011


This takes about 2 secs for 1M rows:

> n <- 1000000
> exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
> require(data.table)
> # convert to data.table
> ed.dt <- data.table(exampledata)
> system.time(result <- ed.dt[
+                         , list(total = sum(itemPrice))
+                         , by = orderID
+                         ]
+            )
   user  system elapsed
   1.30    0.05    1.34
>
> str(result)
Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
 $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
 $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
> head(result)
     orderID total
[1,]       1    49
[2,]       2    37
[3,]       3    72
[4,]       4    92
[5,]       5    50
[6,]       6    76
>


On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
<caroline.faisst at gmail.com> wrote:
> Hello there,
>
>
> I’m computing the total value of an order from the price of the order items
> using a “for” loop and the “ifelse” function. I do this on a large dataframe
> (close to 1m lines). The computation of this function is painfully slow: in
> 1min only about 90 rows are calculated.
>
>
> The computation time taken for a given number of rows increases with the
> size of the dataset, see the example with my function below:
>
>
> # small dataset: function performs well
>
> exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>
> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>
> system.time(for (i in 2:length(exampledata[,1]))
> {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>
>
> # large dataset: the very same computational task takes much longer
>
> exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>
> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>
> system.time(for (i in 2:9)
> {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>
>
>
> Does someone know a way to increase the speed?
>
>
> Thank you very much!
>
> Caroline
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list