[R] slow computation of functions over large datasets

David Winsemius dwinsemius at comcast.net
Wed Aug 3 17:10:25 CEST 2011


On Aug 3, 2011, at 9:59 AM, ONKELINX, Thierry wrote:

> Dear Caroline,
>
> Here is a faster and more elegant solution.
>
>> n <- 10000
>> exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace  
>> = TRUE), itemPrice = rpois(n, 10))
>> library(plyr)
>> system.time({
> + 	ddply(exampledata, .(orderID), function(x){
> + 		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x 
> $itemPrice))
> + 	})
> + })
>   user  system elapsed
>   1.67    0.00    1.69
>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>> system.time(for (i in 2:length(exampledata[,1]))
> + {exampledata[i,"orderAmount"]<- 
> ifelse 
> (exampledata 
> [i 
> ,"orderID 
> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>   user  system elapsed
>  11.94    0.02   11.97

I tried running this method on the "large dataset" (2MM row) the OP  
offered, and needed to eventually interrupt it so I could get my  
console back:

 > system.time({
+  	ddply(exampledata2, .(orderID), function(x){
+  		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x 
$itemPrice))
+  	})
+  })

Timing stopped at: 808.473 1013.749 1816.125

The same task with ave() took 35 seconds.

-- 
david.

>
> Best regards,
>
> Thierry
>> -----Oorspronkelijk bericht-----
>> Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
>> ]
>> Namens Caroline Faisst
>> Verzonden: woensdag 3 augustus 2011 15:26
>> Aan: r-help at r-project.org
>> Onderwerp: [R] slow computation of functions over large datasets
>>
>> Hello there,
>>
>>
>> I'm computing the total value of an order from the price of the  
>> order items using
>> a "for" loop and the "ifelse" function. I do this on a large  
>> dataframe (close to
>> 1m lines). The computation of this function is painfully slow: in  
>> 1min only about
>> 90 rows are calculated.
>>
>>
>> The computation time taken for a given number of rows increases  
>> with the size
>> of the dataset, see the example with my function below:
>>
>>
>> # small dataset: function performs well
>>
>> exampledata<-
>> data 
>> .frame 
>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>
>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>
>> system.time(for (i in 2:length(exampledata[,1]))
>> {exampledata[i,"orderAmount"]<-
>> ifelse 
>> (exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-
>> 1,"orderAmount"] 
>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>
>> # large dataset: the very same computational task takes much longer
>>
>> exampledata2<-
>> data 
>> .frame 
>> (orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1
>> 0,1,9,7,25:2000020))
>>
>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>
>> system.time(for (i in 2:9)
>> {exampledata2[i,"orderAmount"]<-
>> ifelse(exampledata2[i,"orderID"]==exampledata2[i-
>> 1,"orderID"],exampledata2[i-
>> 1,"orderAmount"] 
>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>
>> Does someone know a way to increase the speed?
>>
>>
>> Thank you very much!
>>
>> Caroline

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list