[R] slow computation of functions over large datasets
David Winsemius
dwinsemius at comcast.net
Wed Aug 3 16:26:36 CEST 2011
On Aug 3, 2011, at 9:25 AM, Caroline Faisst wrote:
> Hello there,
>
>
> Im computing the total value of an order from the price of the
> order items
> using a for loop and the ifelse function.
Ouch. Schools really should stop teaching SAS and BASIC as a first
language.
> I do this on a large dataframe
> (close to 1m lines). The computation of this function is painfully
> slow: in
> 1min only about 90 rows are calculated.
>
>
> The computation time taken for a given number of rows increases with
> the
> size of the dataset, see the example with my function below:
>
>
> # small dataset: function performs well
>
> exampledata<-
> data
> .frame
> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>
> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>
> system.time(for (i in 2:length(exampledata[,1]))
> {exampledata[i,"orderAmount"]<-
> ifelse
> (exampledata
> [i
> ,"orderID
> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]
> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
Try instead using 'ave' to calculate a cumulative 'sum' within
"orderID":
exampledata$orderAmt <- with(exampledata, ave(itemPrice, orderID,
FUN=cumsum) )
I assure you this will be more reproducible, faster, and
understandable.
> # large dataset:
"medium" dataset really. Barely nudges the RAM dial on my machine.
> the very same computational task takes much longer
>
> exampledata2<-
> data
> .frame
> (orderID
> =
> c
> (1,1,1,2,2,3,3,3,4,5
> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>
> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>
> system.time(for (i in 2:9)
> {exampledata2[i,"orderAmount"]<-
> ifelse
> (exampledata2
> [i
> ,"orderID
> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]
> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>
>
> system.time( exampledata2$orderAmt <- with(exampledata2,
ave(itemPrice, orderID, FUN=cumsum) ) )
user system elapsed
35.106 0.811 35.822
On a three year-old machine. Not as fast as I expected, but not long
enough to require refilling the coffee cup either.
--
David.
>
> Does someone know a way to increase the speed?
>
--
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list