[R-SIG-Finance] fast date/time format that handles both known and unknown timezones?

Andrew Piskorski atp at piskorski.com
Fri Aug 13 22:26:35 CEST 2010


How do folks here prefer to represent date and time stamps on market
tick data, and how have you dealt with time zones?  Any lessons
learned?

The tick data I have to work with comes with a string date and time
stamp, which typically is in the local exchange timezone but has no
explicit timezone information anywhere in the data.

Now, often I don't care about the timezone anyway, the exchange local
time (or even just elapsed time since market open) is all I'll need.
However, I'll sometimes want to carefully line up timestamps across
different exchanges in different timezones.  In those cases, I'd
prefer to completely ignore timezones until THEN when I actually need
them.

So ideally, I'd like to pick a single date and time format now which
meets all of these (draft) criteria:

1. Fast and compact, for use with tick data.

2. Relatively simple low-level representation, to ease
   interoperability with C and/or other non-R programming languages.

3. Able to cleanly and unambiguously represent date-time stamps both
   with an unspecified (currently unknown) timezone, and with any valid
   timezone.

Thus, I'm very curious what others have used, and what criteria you
found to actually be important in practice...

These are all the R date/time options I'm aware of:  POSIXct, POSIXlt,
chron, and timeDate.  Did I miss any important ones?  as.Date is
useful of course, but only for dates, not times.  Here's what I think
of each of those options so far:

POSIXct's date and time representation is exactly that of Unix, a
count of seconds since 1970-01-01 in GMT.  Nice and simple.  However,
that inherently means that there isn't really any such thing as an
unspecified timezone.  SOME timezone must immediately be assumed in
order to get your data into the underlying seconds-since-GMT-epoch
format.

timeDate appears to be a lot like POSIXct, but it has other features
and I'm not sure whether it's underlying representation is really the
same or not.  I've seen some contradictory information out there on
what it does and what it's really good for; should I investigate
further?

POSIXlt is effectively an R wrapper around the standard Unix libc tm
structure used by strptime(3).  Thus, it is timezone agnostic, and can
easily represent a date and time with the timezone unspecified or with
any particular timezone.  However, it seems unwieldy and annoying to
work with.

chron uses Julian days, presumably in GMT.  By default days are
counted since the Unix epoch of 1970-01-01, but if you pass in
origin=0 you get standard Julian day numbers, where 1970-01-01 =
2440588.  Times are fractional days.  However, chron claims to have no
timezone support at all.

as.Date (class "Date"), like chron, uses Julian days, but its origin
is always 1970-01-01, you cannot change it.  It deals solely with
dates, not times.

Of the above, POSIXct was my obvious initial candidate, it's only
timezone handling that made me think twice.  If I do use POSIXct,
AFAICT my options are either:

1. Put in the work upfront to make sure I have all my timezones
correct for all my data.  I am leery of actually getting that truly
correct, particularly with daylight savings time differences, possible
arbitrary changes in data vendor practices over time, etc.

2. Or, pretend that all my timestamps are actually in GMT, save them
that way on disk, and convert later if/when I need to use the true
exchange-local timezone.

3. Something in between the two above approaches, pick a likely
timezone and save the data in that, regardless of how certain I am
that it's really 100% correct.  However, that then suggests...

4. Build some sort of wrapper around POSIXct which lets me be clearer
about whether I know the correct timezone or not.

And of course there are plenty of ways to build something from scratch
without directly involving POSIXct at all.  (Perhaps even with some
sort of clever flag bits to optionally note timezones embedded within
an otherwise vanilla fractional Julian day or Unix epoch based value.)

Finally, below is some R output to demonstrate the apparent date/time
representations used by various R functions and classes:

------------------------------------------------------------

# R's default Julian date origin is 1970-01-01, but if you pass in
# origin=0, you get standard Julian Day Numbers where 1970-01-01 is
# JDN 2440588.  That also happens to agree exactly with Oracle 10g.
#
# Oddly, with origin=0 R seems to think that JDN 0 is 4713-11-24 BC
# rather than either the 4712-01-01 BC from Oracle or the 4713-01-01
# BC given on Wikipedia.  However this does not seem to make any
# practical difference.
#   http://en.wikipedia.org/wiki/Julian_day

# chron and its helper functions julian, month.day.year, etc.:
> unlist(month.day.year(0))
month   day  year 
    1     1  1970 
> unlist(month.day.year(0 ,origin=0))
month   day  year 
   11    24 -4713 
> chron:::julian.default(1,1,1970)
[1] 0
> chron:::julian.default(1,1,1970 ,origin=0)
[1] 2440588
> print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s"))) ,digits=17)
[1] 0.0000000000000000000 0.0006944444444444445
> print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s") ,origin=0)) ,digits=17)
[1] 2440588.000000000 2440588.000694444
> 0.0006944444444444445 *24*60
[1] 1


oracle> select to_char(to_date('4712-01-01 BC','YYYY-MM-DD BC'),'J') as julian from dual;
JULIAN
-------
0000001
oracle> select to_char(to_date('1970-01-01 AD','YYYY-MM-DD BC'),'J') as julian from dual;
JULIAN
-------
2440588

oracle> select to_char(to_date('1','J') ,'BC YYYY-MM-DD HH24:MI:SS') as origin from dual;
ORIGIN
----------------------
BC 4712-01-01 00:00:00
oracle> select to_char(to_date('2440588','J') ,'BC YYYY-MM-DD HH24:MI:SS') as epoch from dual;
EPOCH
----------------------
AD 1970-01-01 00:00:00

oracle> select to_char(to_date('4713-11-24 BC','YYYY-MM-DD BC'),'J') as julian from dual;
ERROR at line 1:
ORA-01841: (full) year must be between -4713 and +9999, and not be 0


# as.Date:  as.Date.character uses strptime, and origin=0 has no effect.
> as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02")))
[1] -1  0  1
> as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02") ,origin=0))
[1] -1  0  1


# as.POSIXct:  Default is local timezone, which happens to be EST -0500:
> as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))
[1]  0 60
> as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))
[1] 18000 18060
> 18000 /60^2
[1] 5


posixlt.to.tz.df <- function(xx) {
   rr <- as.data.frame(unclass(xx))
   rr[,"tzone"] <- NA
   if (!is.na(match("tzone" ,names(attributes(xx)))))
      rr[,"tzone"] <- attributes(xx)$tzone
   rr
}

# as.POSIXlt:  Is technically capable of expressing "timezone unknown":
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))
  sec min hour mday mon year wday yday isdst tzone
1   0   0    0    1   0   70    4    0     0   GMT
2   0   1    0    1   0   70    4    0     0   GMT
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="EST"))
  sec min hour mday mon year wday yday isdst tzone
1   0   0    0    1   0   70    4    0     0   EST
2   0   1    0    1   0   70    4    0     0   EST
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))
  sec min hour mday mon year wday yday isdst tzone
1   0   0    0    1   0   70    4    0     0    NA
2   0   1    0    1   0   70    4    0     0    NA
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST"))
  sec min hour mday mon year wday yday isdst tzone
1   0   0    0    1   0   70    4    0     0   PST
2   0   1    0    1   0   70    4    0     0   PST
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800"))
  sec min hour mday mon year wday yday isdst tzone
1   0   0    0    1   0   70    4    0     0 -0800
2   0   1    0    1   0   70    4    0     0 -0800

# as.POSIXct is in fact implemented by using strptime and as.POSIXlt,
# but here show explicit conversion of lt to ct anyway:
> as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT")))
[1]  0 60
> as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00"))))
[1] 18000 18060
> 18000 /60^2
[1] 5

# BUG: Oddly, as.POSIXct() does NOT seem to respect non-GMT non-local
# timzones like PST:
> as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST")))
[1]  0 60
> as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800")))
[1]  0 60


# Note the (documented) subtlety due to GMT/UTC when converting
# between class "Date" and "POSIXct":
> as.POSIXct("2010-03-13 19:00:00 EST")
[1] "2010-03-13 19:00:00 EST"
> as.Date(as.POSIXct("2010-03-13 19:00:00 EST"))
[1] "2010-03-14"
> as.Date("2010-03-14"))
[1] "2010-03-14 UTC"
> as.POSIXct(as.Date("2010-03-14"))
[1] "2010-03-13 19:00:00 EST"


# timeDate appears to be much like POSIXct, counting seconds since the
# Unix epoch, but perhaps its underlying implementation is actually
# different, given its explicit use of a "units" attribute below:

> timeDate(0)
GMT
[1] [1970-01-01]
> timeDate("1970-01-01")
GMT
[1] [1970-01-01]

> as.numeric(timeDate("1970-01-01"))
[1] 0
attr(,"FinCenter")
[1] "GMT"
attr(,"units")
[1] "secs"
attr(,"origin")
[1] "1970-01-01 00:00:00 GMT"

> timeDate(2)
GMT
[1] [1970-01-01 00:00:02]

> as.numeric(timeDate(2))
[1] 2
attr(,"FinCenter")
[1] "GMT"
attr(,"units")
[1] "secs"
attr(,"origin")
[1] "1970-01-01 00:00:00 GMT"

> as.numeric(timeDate(24*60^2))
[1] 1
attr(,"FinCenter")
[1] "GMT"
attr(,"units")
[1] "days"
attr(,"origin")
[1] "1970-01-01 GMT"

------------------------------------------------------------

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/



More information about the R-SIG-Finance mailing list