[R-SIG-Finance] fast date/time format that handles both known and unknown timezones?
Dirk Eddelbuettel
edd at debian.org
Fri Aug 13 23:00:19 CEST 2010
Hi Andrew,
Great question. I'll just have a few small comments:
On 13 August 2010 at 16:26, Andrew Piskorski wrote:
| How do folks here prefer to represent date and time stamps on market
| tick data, and how have you dealt with time zones? Any lessons
| learned?
|
| The tick data I have to work with comes with a string date and time
| stamp, which typically is in the local exchange timezone but has no
| explicit timezone information anywhere in the data.
|
| Now, often I don't care about the timezone anyway, the exchange local
| time (or even just elapsed time since market open) is all I'll need.
| However, I'll sometimes want to carefully line up timestamps across
| different exchanges in different timezones. In those cases, I'd
| prefer to completely ignore timezones until THEN when I actually need
| them.
|
| So ideally, I'd like to pick a single date and time format now which
| meets all of these (draft) criteria:
|
| 1. Fast and compact, for use with tick data.
|
| 2. Relatively simple low-level representation, to ease
| interoperability with C and/or other non-R programming languages.
|
| 3. Able to cleanly and unambiguously represent date-time stamps both
| with an unspecified (currently unknown) timezone, and with any valid
| timezone.
|
| Thus, I'm very curious what others have used, and what criteria you
| found to actually be important in practice...
|
| These are all the R date/time options I'm aware of: POSIXct, POSIXlt,
| chron, and timeDate. Did I miss any important ones? as.Date is
| useful of course, but only for dates, not times. Here's what I think
| of each of those options so far:
|
| POSIXct's date and time representation is exactly that of Unix, a
| count of seconds since 1970-01-01 in GMT. Nice and simple. However,
It is better than Unix as it is fractional seconds with a microsecond
granularity where as standard Unix has (iirc) unsigned long integers.
> options("digits.secs"=6) # needed or sub-seconds not shown
> Sys.time()
[1] "2010-08-13 15:57:37.790672 CDT"
>
I use this type a lot for the same reasons you list: fast, light,
interchangeable. But as you point out, no timezones.
| that inherently means that there isn't really any such thing as an
| unspecified timezone. SOME timezone must immediately be assumed in
| order to get your data into the underlying seconds-since-GMT-epoch
| format.
|
| timeDate appears to be a lot like POSIXct, but it has other features
| and I'm not sure whether it's underlying representation is really the
| same or not. I've seen some contradictory information out there on
| what it does and what it's really good for; should I investigate
| further?
|
| POSIXlt is effectively an R wrapper around the standard Unix libc tm
| structure used by strptime(3). Thus, it is timezone agnostic, and can
| easily represent a date and time with the timezone unspecified or with
| any particular timezone. However, it seems unwieldy and annoying to
| work with.
But there are converters to and from POSIXct ! So it is not "instead of" but
rather "along with" making POSIXct an even stronger candidate (provided you
handle TZ).
Also, difftime() works well on these types.
| chron uses Julian days, presumably in GMT. By default days are
| counted since the Unix epoch of 1970-01-01, but if you pass in
| origin=0 you get standard Julian day numbers, where 1970-01-01 =
| 2440588. Times are fractional days. However, chron claims to have no
| timezone support at all.
|
| as.Date (class "Date"), like chron, uses Julian days, but its origin
| is always 1970-01-01, you cannot change it. It deals solely with
| dates, not times.
Yes you can change the origin. See help(as.Date).
Cheers, Dirk
| Of the above, POSIXct was my obvious initial candidate, it's only
| timezone handling that made me think twice. If I do use POSIXct,
| AFAICT my options are either:
|
| 1. Put in the work upfront to make sure I have all my timezones
| correct for all my data. I am leery of actually getting that truly
| correct, particularly with daylight savings time differences, possible
| arbitrary changes in data vendor practices over time, etc.
|
| 2. Or, pretend that all my timestamps are actually in GMT, save them
| that way on disk, and convert later if/when I need to use the true
| exchange-local timezone.
|
| 3. Something in between the two above approaches, pick a likely
| timezone and save the data in that, regardless of how certain I am
| that it's really 100% correct. However, that then suggests...
|
| 4. Build some sort of wrapper around POSIXct which lets me be clearer
| about whether I know the correct timezone or not.
|
| And of course there are plenty of ways to build something from scratch
| without directly involving POSIXct at all. (Perhaps even with some
| sort of clever flag bits to optionally note timezones embedded within
| an otherwise vanilla fractional Julian day or Unix epoch based value.)
|
| Finally, below is some R output to demonstrate the apparent date/time
| representations used by various R functions and classes:
|
| ------------------------------------------------------------
|
| # R's default Julian date origin is 1970-01-01, but if you pass in
| # origin=0, you get standard Julian Day Numbers where 1970-01-01 is
| # JDN 2440588. That also happens to agree exactly with Oracle 10g.
| #
| # Oddly, with origin=0 R seems to think that JDN 0 is 4713-11-24 BC
| # rather than either the 4712-01-01 BC from Oracle or the 4713-01-01
| # BC given on Wikipedia. However this does not seem to make any
| # practical difference.
| # http://en.wikipedia.org/wiki/Julian_day
|
| # chron and its helper functions julian, month.day.year, etc.:
| > unlist(month.day.year(0))
| month day year
| 1 1 1970
| > unlist(month.day.year(0 ,origin=0))
| month day year
| 11 24 -4713
| > chron:::julian.default(1,1,1970)
| [1] 0
| > chron:::julian.default(1,1,1970 ,origin=0)
| [1] 2440588
| > print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s"))) ,digits=17)
| [1] 0.0000000000000000000 0.0006944444444444445
| > print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s") ,origin=0)) ,digits=17)
| [1] 2440588.000000000 2440588.000694444
| > 0.0006944444444444445 *24*60
| [1] 1
|
|
| oracle> select to_char(to_date('4712-01-01 BC','YYYY-MM-DD BC'),'J') as julian from dual;
| JULIAN
| -------
| 0000001
| oracle> select to_char(to_date('1970-01-01 AD','YYYY-MM-DD BC'),'J') as julian from dual;
| JULIAN
| -------
| 2440588
|
| oracle> select to_char(to_date('1','J') ,'BC YYYY-MM-DD HH24:MI:SS') as origin from dual;
| ORIGIN
| ----------------------
| BC 4712-01-01 00:00:00
| oracle> select to_char(to_date('2440588','J') ,'BC YYYY-MM-DD HH24:MI:SS') as epoch from dual;
| EPOCH
| ----------------------
| AD 1970-01-01 00:00:00
|
| oracle> select to_char(to_date('4713-11-24 BC','YYYY-MM-DD BC'),'J') as julian from dual;
| ERROR at line 1:
| ORA-01841: (full) year must be between -4713 and +9999, and not be 0
|
|
| # as.Date: as.Date.character uses strptime, and origin=0 has no effect.
| > as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02")))
| [1] -1 0 1
| > as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02") ,origin=0))
| [1] -1 0 1
|
|
| # as.POSIXct: Default is local timezone, which happens to be EST -0500:
| > as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))
| [1] 0 60
| > as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))
| [1] 18000 18060
| > 18000 /60^2
| [1] 5
|
|
| posixlt.to.tz.df <- function(xx) {
| rr <- as.data.frame(unclass(xx))
| rr[,"tzone"] <- NA
| if (!is.na(match("tzone" ,names(attributes(xx)))))
| rr[,"tzone"] <- attributes(xx)$tzone
| rr
| }
|
| # as.POSIXlt: Is technically capable of expressing "timezone unknown":
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))
| sec min hour mday mon year wday yday isdst tzone
| 1 0 0 0 1 0 70 4 0 0 GMT
| 2 0 1 0 1 0 70 4 0 0 GMT
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="EST"))
| sec min hour mday mon year wday yday isdst tzone
| 1 0 0 0 1 0 70 4 0 0 EST
| 2 0 1 0 1 0 70 4 0 0 EST
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))
| sec min hour mday mon year wday yday isdst tzone
| 1 0 0 0 1 0 70 4 0 0 NA
| 2 0 1 0 1 0 70 4 0 0 NA
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST"))
| sec min hour mday mon year wday yday isdst tzone
| 1 0 0 0 1 0 70 4 0 0 PST
| 2 0 1 0 1 0 70 4 0 0 PST
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800"))
| sec min hour mday mon year wday yday isdst tzone
| 1 0 0 0 1 0 70 4 0 0 -0800
| 2 0 1 0 1 0 70 4 0 0 -0800
|
| # as.POSIXct is in fact implemented by using strptime and as.POSIXlt,
| # but here show explicit conversion of lt to ct anyway:
| > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT")))
| [1] 0 60
| > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00"))))
| [1] 18000 18060
| > 18000 /60^2
| [1] 5
|
| # BUG: Oddly, as.POSIXct() does NOT seem to respect non-GMT non-local
| # timzones like PST:
| > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST")))
| [1] 0 60
| > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800")))
| [1] 0 60
|
|
| # Note the (documented) subtlety due to GMT/UTC when converting
| # between class "Date" and "POSIXct":
| > as.POSIXct("2010-03-13 19:00:00 EST")
| [1] "2010-03-13 19:00:00 EST"
| > as.Date(as.POSIXct("2010-03-13 19:00:00 EST"))
| [1] "2010-03-14"
| > as.Date("2010-03-14"))
| [1] "2010-03-14 UTC"
| > as.POSIXct(as.Date("2010-03-14"))
| [1] "2010-03-13 19:00:00 EST"
|
|
| # timeDate appears to be much like POSIXct, counting seconds since the
| # Unix epoch, but perhaps its underlying implementation is actually
| # different, given its explicit use of a "units" attribute below:
|
| > timeDate(0)
| GMT
| [1] [1970-01-01]
| > timeDate("1970-01-01")
| GMT
| [1] [1970-01-01]
|
| > as.numeric(timeDate("1970-01-01"))
| [1] 0
| attr(,"FinCenter")
| [1] "GMT"
| attr(,"units")
| [1] "secs"
| attr(,"origin")
| [1] "1970-01-01 00:00:00 GMT"
|
| > timeDate(2)
| GMT
| [1] [1970-01-01 00:00:02]
|
| > as.numeric(timeDate(2))
| [1] 2
| attr(,"FinCenter")
| [1] "GMT"
| attr(,"units")
| [1] "secs"
| attr(,"origin")
| [1] "1970-01-01 00:00:00 GMT"
|
| > as.numeric(timeDate(24*60^2))
| [1] 1
| attr(,"FinCenter")
| [1] "GMT"
| attr(,"units")
| [1] "days"
| attr(,"origin")
| [1] "1970-01-01 GMT"
|
| ------------------------------------------------------------
|
| --
| Andrew Piskorski <atp at piskorski.com>
| http://www.piskorski.com/
|
| _______________________________________________
| R-SIG-Finance at stat.math.ethz.ch mailing list
| https://stat.ethz.ch/mailman/listinfo/r-sig-finance
| -- Subscriber-posting only. If you want to post, subscribe first.
| -- Also note that this is not the r-help list where general R questions should go.
--
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
More information about the R-SIG-Finance
mailing list