[R] slow computation of functions over large datasets

David Winsemius dwinsemius at comcast.net
Wed Aug 3 21:46:09 CEST 2011


On Aug 3, 2011, at 3:05 PM, Ken wrote:

> Sorry about the lack of code, but using Davids example, would:
> tapply(itemPrice, INDEX=orderID, FUN=sum)
> work?

Doesn't do the cumulative sums or the assignment into column of the  
same data.frame. That's why I used ave, because it keeps the sequence  
correct.

-- 
David.
>  -Ken Hutchison
>
> On Aug 3, 2554 BE, at 2:09 PM, David Winsemius  
> <dwinsemius at comcast.net> wrote:
>
>>
>> On Aug 3, 2011, at 2:01 PM, Ken wrote:
>>
>>> Hello,
>>> Perhaps transpose the table attach(as.data.frame(t(data))) and use  
>>> ColSums() function with order id as header.
>>>           -Ken Hutchison
>>
>> Got any code? The OP offered a reproducible example, after all.
>>
>> -- 
>> David.
>>>
>>> On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net 
>>> > wrote:
>>>
>>>>
>>>> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>>>>
>>>>> This takes about 2 secs for 1M rows:
>>>>>
>>>>>> n <- 1000000
>>>>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n,  
>>>>>> replace = TRUE), itemPrice = rpois(n, 10))
>>>>>> require(data.table)
>>>>>> # convert to data.table
>>>>>> ed.dt <- data.table(exampledata)
>>>>>> system.time(result <- ed.dt[
>>>>> +                         , list(total = sum(itemPrice))
>>>>> +                         , by = orderID
>>>>> +                         ]
>>>>> +            )
>>>>> user  system elapsed
>>>>> 1.30    0.05    1.34
>>>>
>>>> Interesting. Impressive. And I noted that the OP wanted what  
>>>> cumsum would provide and for some reason creating that longer  
>>>> result is even faster on my machine than the shorter result using  
>>>> sum.
>>>>
>>>> -- 
>>>> David.
>>>>>>
>>>>>> str(result)
>>>>> Classes ‘data.table’ and 'data.frame':  198708 obs. of  2  
>>>>> variables:
>>>>> $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
>>>>> $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
>>>>>> head(result)
>>>>> orderID total
>>>>> [1,]       1    49
>>>>> [2,]       2    37
>>>>> [3,]       3    72
>>>>> [4,]       4    92
>>>>> [5,]       5    50
>>>>> [6,]       6    76
>>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>>>>> <caroline.faisst at gmail.com> wrote:
>>>>>> Hello there,
>>>>>>
>>>>>>
>>>>>> I’m computing the total value of an order from the price of the  
>>>>>> order items
>>>>>> using a “for” loop and the “ifelse” function. I do this on a  
>>>>>> large dataframe
>>>>>> (close to 1m lines). The computation of this function is  
>>>>>> painfully slow: in
>>>>>> 1min only about 90 rows are calculated.
>>>>>>
>>>>>>
>>>>>> The computation time taken for a given number of rows increases  
>>>>>> with the
>>>>>> size of the dataset, see the example with my function below:
>>>>>>
>>>>>>
>>>>>> # small dataset: function performs well
>>>>>>
>>>>>> exampledata<- 
>>>>>> data 
>>>>>> .frame 
>>>>>> (orderID 
>>>>>> =c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>>>>
>>>>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>>>>
>>>>>> system.time(for (i in 2:length(exampledata[,1]))
>>>>>> {exampledata[i,"orderAmount"]<- 
>>>>>> ifelse 
>>>>>> (exampledata 
>>>>>> [i 
>>>>>> ,"orderID 
>>>>>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
>>>>>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>>>>
>>>>>>
>>>>>> # large dataset: the very same computational task takes much  
>>>>>> longer
>>>>>>
>>>>>> exampledata2<- 
>>>>>> data 
>>>>>> .frame 
>>>>>> (orderID 
>>>>>> = 
>>>>>> c 
>>>>>> (1,1,1,2,2,3,3,3,4,5 
>>>>>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>>>>
>>>>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>>>>
>>>>>> system.time(for (i in 2:9)
>>>>>> {exampledata2[i,"orderAmount"]<- 
>>>>>> ifelse 
>>>>>> (exampledata2 
>>>>>> [i 
>>>>>> ,"orderID 
>>>>>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] 
>>>>>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does someone know a way to increase the speed?
>>>>>>
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>> Caroline
>>>>>>
>>>>>>    [[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible  
>>>>>> code.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list