[Rd] as.Date.character speed improvement suggestion
Simon Urbanek
simon.urbanek at r-project.org
Fri Aug 16 22:02:37 CEST 2013
On Aug 16, 2013, at 1:54 PM, McGehee, Robert wrote:
> R-Devel,
> I store and retrieve a large amount of financial data (millions of rows) in a PostgreSQL database keyed by date (and represented in R by class Date). Unfortunately, I frequently find that a great deal of processing time is spent converting dates from character representations to Date class representations in R, presumably because strptime is not fast for large vectors (>10,000 elements). I'd like to suggest a patch that speeds up the date conversion considerably for most every large date vectors (up to 400x in some real life cases).
>
This is more of a comment: if you want speed and have a standard date format, you can use fastPOSIXct from fasttime. The real bottleneck are system calls that do the conversion and fasttime is avoiding them by doing fast string parsing instead:
> system.time(dt1 <- as.Date.character(dtch))
user system elapsed
31.513 0.046 31.559
> system.time(dt1 <- as.Date(fasttime::fastPOSIXct(dtch)))
user system elapsed
0.055 0.018 0.074
Cutting back to unique dates may works for some applications (not for any of ours because we are always dealing with timestamps - but that's why we use POSIXct and not Date), but I'd argue that you may as well do it right away in your specialized application instead.
Cheers,
Simon
> I suspect most everyone with large vectors of class Date will find that most of their values are duplicated (repeatedly). (There are, after all, only 36,524 days in a century.) Given this, as.Date.character can be sped up substantially for large vectors by only calling strptime on unique dates and then filling in the calculated values for the entire vector. Since the time savings can be several minutes in real-life cases, I think this enhancement should certainly be considered. Also, in a worst case scenario of a long vector with only one duplicated value, the suggested change does not slow down the calculation.
>
> Here's a proof of concept:
> as.Date.character2 <- function(x, ...) {
> if (anyDuplicated(x)) {
> ux <- unique(x)
> idx <- match(x, ux)
> y <- as.Date.character(ux, ...)
> return(y[idx])
> }
> as.Date.character(x, ...)
> }
>
> ## Example1: Construct a 1-million length character vector of 1000 unique dates
> ## By considering only unique values, speed is >250x faster
>
>> dtch <- format(sample(Sys.Date()-1:1000, 1e6, replace=TRUE))
>> system.time(dt1 <- as.Date.character(dtch))
> user system elapsed
> 12.630 23.628 36.262
>> system.time(dt2 <- as.Date.character2(dtch))
> user system elapsed
> 0.117 0.019 0.136
>> identical(dt1, dt2)
> [1] TRUE
>
>
> ## Example2: In a "worst case" scenario of a 1,000,002 length character of 1,000,001 unique dates
> ## the new function is not any slower (within error).
>> dtch <- format(c(Sys.Date(), Sys.Date()+-5e5:5e5))
>> system.time(dt1 <- as.Date.character(dtch))
> user system elapsed
> 20.264 25.584 45.855
>> system.time(dt2 <- as.Date.character2(dtch))
> user system elapsed
> 20.525 24.809 45.335
>> identical(dt1, dt2)
> [1] TRUE
>
> Alternatively, this logic should be built in to strptime itself.
>
> Robert
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
More information about the R-devel
mailing list