[R] Performance of 'by' and 'ddply' on a large data frame
Tahir Butt
tahir.butt at gmail.com
Fri Nov 20 21:04:07 CET 2009
A faster solution using tapply was sent to me via email:
testtapply = function(p){
df = randomdf(p)
system.time({res = tapply(df$x2,df$x1,min);
res = as.Date(res,origin=as.Date('1970-01-01'));
df$mindate = res[as.character(df$x1)]})
}
Thanks Phil!
Tahir
On Thu, Nov 19, 2009 at 5:19 PM, Tahir Butt <tahir.butt at gmail.com> wrote:
> I've only recently started using R. One of the problems I come up
> against is after having extracted a large dataset (>5M rows) out of
> database, I realize I need another variable. In this case I have data
> frame with dates. I want to find the minimum date for each value of x1
> and add that minimum date to my data.frame.
>
>> randomdf <- function(p) {
> data.frame(x1=sample(1:10^4, 10^p, replace=T),
> x2=sample(seq.Date(Sys.Date() - 356*3,Sys.Date(), by="day"), 10^p, replace=T),
> y1=sample(1:100, 10^p, replace=T))
> }
>> testby <- function(p) {
> df <- randomdf(p)
> system.time(by(df, df$x1, function(dfi) { min(dfi$x2) }))
> }
>> lapply(c(1,2,3,4,5), testby)
> [[1]]
> user system elapsed
> 0.006 0.000 0.006
>
> [[2]]
> user system elapsed
> 0.024 0.000 0.025
>
> [[3]]
> user system elapsed
> 0.233 0.000 0.234
>
> [[4]]
> user system elapsed
> 1.996 0.026 2.022
>
> [[5]]
> user system elapsed
> 11.030 0.000 11.032
>
> Strangely enough, not sure why this is, the result of by with the min
> function is not date objects but instead integers representing days
> from an origin. Is there a min function that would return me a date
> instead of an integer? Or is this a result of using by?
>
> I also wanted to see how ddply compares.
>
>> testddply <- function(p) { pdf <- randomdf(p); system.time(ddply(pdf, .(x1), function(df) { return (data.frame(min(df$x2))) })) }
>> lapply(c(1,2,3,4,5), testddply)
> [[1]]
> user system elapsed
> 0.020 0.000 0.021
>
> [[2]]
> user system elapsed
> 0.119 0.000 0.119
>
> [[3]]
> user system elapsed
> 1.008 0.000 1.008
>
> [[4]]
> user system elapsed
> 8.425 0.001 8.428
>
> [[5]]
> user system elapsed
> 23.070 0.000 23.075
>
> Once the data frame gets above 1M rows, the timings are a bit too long
> (on a previous run it went up to 8000s user time). This seems quite a
> bit slower than I expected. Maybe there's a better and faster way to
> add such variables to a data frame that are derived using some
> aggregation.
>
> Also, ddply seems to take twice as long as by. Are these two
> operations not equivalent?
>
> Thanks,
> Tahir
>
More information about the R-help
mailing list