[R] slow computation of functions over large datasets
Ken
vicvoncastle at gmail.com
Wed Aug 3 20:01:38 CEST 2011
Hello,
Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header.
-Ken Hutchison
On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>
>> This takes about 2 secs for 1M rows:
>>
>>> n <- 1000000
>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
>>> require(data.table)
>>> # convert to data.table
>>> ed.dt <- data.table(exampledata)
>>> system.time(result <- ed.dt[
>> + , list(total = sum(itemPrice))
>> + , by = orderID
>> + ]
>> + )
>> user system elapsed
>> 1.30 0.05 1.34
>
> Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum.
>
> --
> David.
>>>
>>> str(result)
>> Classes ‘data.table’ and 'data.frame': 198708 obs. of 2 variables:
>> $ orderID: int 1 2 3 4 5 6 8 9 10 11 ...
>> $ total : num 49 37 72 92 50 76 34 22 65 39 ...
>>> head(result)
>> orderID total
>> [1,] 1 49
>> [2,] 2 37
>> [3,] 3 72
>> [4,] 4 92
>> [5,] 5 50
>> [6,] 6 76
>>>
>>
>>
>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>> <caroline.faisst at gmail.com> wrote:
>>> Hello there,
>>>
>>>
>>> I’m computing the total value of an order from the price of the order items
>>> using a “for” loop and the “ifelse” function. I do this on a large dataframe
>>> (close to 1m lines). The computation of this function is painfully slow: in
>>> 1min only about 90 rows are calculated.
>>>
>>>
>>> The computation time taken for a given number of rows increases with the
>>> size of the dataset, see the example with my function below:
>>>
>>>
>>> # small dataset: function performs well
>>>
>>> exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>
>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>
>>> system.time(for (i in 2:length(exampledata[,1]))
>>> {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>
>>>
>>> # large dataset: the very same computational task takes much longer
>>>
>>> exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>
>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>
>>> system.time(for (i in 2:9)
>>> {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>
>>>
>>>
>>> Does someone know a way to increase the speed?
>>>
>>>
>>> Thank you very much!
>>>
>>> Caroline
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list