[R] slow computation of functions over large datasets

David Winsemius dwinsemius at comcast.net
Wed Aug 3 16:26:36 CEST 2011


On Aug 3, 2011, at 9:25 AM, Caroline Faisst wrote:

> Hello there,
>
>
> I’m computing the total value of an order from the price of the  
> order items
> using a “for” loop and the “ifelse” function.

Ouch. Schools really should stop teaching SAS and BASIC as a first  
language.

> I do this on a large dataframe
> (close to 1m lines). The computation of this function is painfully  
> slow: in
> 1min only about 90 rows are calculated.
>
>
> The computation time taken for a given number of rows increases with  
> the
> size of the dataset, see the example with my function below:
>
>
> # small dataset: function performs well
>
> exampledata<- 
> data 
> .frame 
> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>
> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>
> system.time(for (i in 2:length(exampledata[,1]))
> {exampledata[i,"orderAmount"]<- 
> ifelse 
> (exampledata 
> [i 
> ,"orderID 
> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})

Try instead using 'ave' to calculate a cumulative 'sum' within  
"orderID":

exampledata$orderAmt <- with(exampledata,  ave(itemPrice, orderID,  
FUN=cumsum) )

I assure you this will be more reproducible,  faster, and  
understandable.

> # large dataset:

"medium" dataset really. Barely nudges the RAM dial on my machine.

> the very same computational task takes much longer
>
> exampledata2<- 
> data 
> .frame 
> (orderID 
> = 
> c 
> (1,1,1,2,2,3,3,3,4,5 
> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>
> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>
> system.time(for (i in 2:9)
> {exampledata2[i,"orderAmount"]<- 
> ifelse 
> (exampledata2 
> [i 
> ,"orderID 
> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] 
> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>
>
 > system.time( exampledata2$orderAmt <- with(exampledata2,   
ave(itemPrice, orderID, FUN=cumsum) ) )
    user  system elapsed
  35.106   0.811  35.822

On a three year-old machine. Not as fast as I expected, but not long  
enough to require refilling the coffee cup either.

-- 
David.
>
> Does someone know a way to increase the speed?
>

-- 

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list