[R-SIG-Finance] fast date/time format that handles both known and unknown timezones?

Dirk Eddelbuettel edd at debian.org
Fri Aug 13 23:00:19 CEST 2010


Hi Andrew,

Great question.  I'll just have a few small comments:

On 13 August 2010 at 16:26, Andrew Piskorski wrote:
| How do folks here prefer to represent date and time stamps on market
| tick data, and how have you dealt with time zones?  Any lessons
| learned?
| 
| The tick data I have to work with comes with a string date and time
| stamp, which typically is in the local exchange timezone but has no
| explicit timezone information anywhere in the data.
| 
| Now, often I don't care about the timezone anyway, the exchange local
| time (or even just elapsed time since market open) is all I'll need.
| However, I'll sometimes want to carefully line up timestamps across
| different exchanges in different timezones.  In those cases, I'd
| prefer to completely ignore timezones until THEN when I actually need
| them.
| 
| So ideally, I'd like to pick a single date and time format now which
| meets all of these (draft) criteria:
| 
| 1. Fast and compact, for use with tick data.
| 
| 2. Relatively simple low-level representation, to ease
|    interoperability with C and/or other non-R programming languages.
| 
| 3. Able to cleanly and unambiguously represent date-time stamps both
|    with an unspecified (currently unknown) timezone, and with any valid
|    timezone.
| 
| Thus, I'm very curious what others have used, and what criteria you
| found to actually be important in practice...
| 
| These are all the R date/time options I'm aware of:  POSIXct, POSIXlt,
| chron, and timeDate.  Did I miss any important ones?  as.Date is
| useful of course, but only for dates, not times.  Here's what I think
| of each of those options so far:
| 
| POSIXct's date and time representation is exactly that of Unix, a
| count of seconds since 1970-01-01 in GMT.  Nice and simple.  However,

It is better than Unix as it is fractional seconds with a microsecond
granularity where as standard Unix has (iirc) unsigned long integers.

   > options("digits.secs"=6)  # needed or sub-seconds not shown
   > Sys.time()
   [1] "2010-08-13 15:57:37.790672 CDT"
   > 

I use this type a lot for the same reasons you list: fast, light,
interchangeable.  But as you point out, no timezones.

| that inherently means that there isn't really any such thing as an
| unspecified timezone.  SOME timezone must immediately be assumed in
| order to get your data into the underlying seconds-since-GMT-epoch
| format.
| 
| timeDate appears to be a lot like POSIXct, but it has other features
| and I'm not sure whether it's underlying representation is really the
| same or not.  I've seen some contradictory information out there on
| what it does and what it's really good for; should I investigate
| further?
| 
| POSIXlt is effectively an R wrapper around the standard Unix libc tm
| structure used by strptime(3).  Thus, it is timezone agnostic, and can
| easily represent a date and time with the timezone unspecified or with
| any particular timezone.  However, it seems unwieldy and annoying to
| work with.

But there are converters to and from POSIXct !  So it is not "instead of" but
rather "along with" making POSIXct an even stronger candidate (provided you
handle TZ).

Also, difftime() works well on these types.

| chron uses Julian days, presumably in GMT.  By default days are
| counted since the Unix epoch of 1970-01-01, but if you pass in
| origin=0 you get standard Julian day numbers, where 1970-01-01 =
| 2440588.  Times are fractional days.  However, chron claims to have no
| timezone support at all.
| 
| as.Date (class "Date"), like chron, uses Julian days, but its origin
| is always 1970-01-01, you cannot change it.  It deals solely with
| dates, not times.

Yes you can change the origin. See help(as.Date).

Cheers, Dirk

| Of the above, POSIXct was my obvious initial candidate, it's only
| timezone handling that made me think twice.  If I do use POSIXct,
| AFAICT my options are either:
| 
| 1. Put in the work upfront to make sure I have all my timezones
| correct for all my data.  I am leery of actually getting that truly
| correct, particularly with daylight savings time differences, possible
| arbitrary changes in data vendor practices over time, etc.
| 
| 2. Or, pretend that all my timestamps are actually in GMT, save them
| that way on disk, and convert later if/when I need to use the true
| exchange-local timezone.
| 
| 3. Something in between the two above approaches, pick a likely
| timezone and save the data in that, regardless of how certain I am
| that it's really 100% correct.  However, that then suggests...
| 
| 4. Build some sort of wrapper around POSIXct which lets me be clearer
| about whether I know the correct timezone or not.
| 
| And of course there are plenty of ways to build something from scratch
| without directly involving POSIXct at all.  (Perhaps even with some
| sort of clever flag bits to optionally note timezones embedded within
| an otherwise vanilla fractional Julian day or Unix epoch based value.)
| 
| Finally, below is some R output to demonstrate the apparent date/time
| representations used by various R functions and classes:
| 
| ------------------------------------------------------------
| 
| # R's default Julian date origin is 1970-01-01, but if you pass in
| # origin=0, you get standard Julian Day Numbers where 1970-01-01 is
| # JDN 2440588.  That also happens to agree exactly with Oracle 10g.
| #
| # Oddly, with origin=0 R seems to think that JDN 0 is 4713-11-24 BC
| # rather than either the 4712-01-01 BC from Oracle or the 4713-01-01
| # BC given on Wikipedia.  However this does not seem to make any
| # practical difference.
| #   http://en.wikipedia.org/wiki/Julian_day
| 
| # chron and its helper functions julian, month.day.year, etc.:
| > unlist(month.day.year(0))
| month   day  year 
|     1     1  1970 
| > unlist(month.day.year(0 ,origin=0))
| month   day  year 
|    11    24 -4713 
| > chron:::julian.default(1,1,1970)
| [1] 0
| > chron:::julian.default(1,1,1970 ,origin=0)
| [1] 2440588
| > print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s"))) ,digits=17)
| [1] 0.0000000000000000000 0.0006944444444444445
| > print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s") ,origin=0)) ,digits=17)
| [1] 2440588.000000000 2440588.000694444
| > 0.0006944444444444445 *24*60
| [1] 1
| 
| 
| oracle> select to_char(to_date('4712-01-01 BC','YYYY-MM-DD BC'),'J') as julian from dual;
| JULIAN
| -------
| 0000001
| oracle> select to_char(to_date('1970-01-01 AD','YYYY-MM-DD BC'),'J') as julian from dual;
| JULIAN
| -------
| 2440588
| 
| oracle> select to_char(to_date('1','J') ,'BC YYYY-MM-DD HH24:MI:SS') as origin from dual;
| ORIGIN
| ----------------------
| BC 4712-01-01 00:00:00
| oracle> select to_char(to_date('2440588','J') ,'BC YYYY-MM-DD HH24:MI:SS') as epoch from dual;
| EPOCH
| ----------------------
| AD 1970-01-01 00:00:00
| 
| oracle> select to_char(to_date('4713-11-24 BC','YYYY-MM-DD BC'),'J') as julian from dual;
| ERROR at line 1:
| ORA-01841: (full) year must be between -4713 and +9999, and not be 0
| 
| 
| # as.Date:  as.Date.character uses strptime, and origin=0 has no effect.
| > as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02")))
| [1] -1  0  1
| > as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02") ,origin=0))
| [1] -1  0  1
| 
| 
| # as.POSIXct:  Default is local timezone, which happens to be EST -0500:
| > as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))
| [1]  0 60
| > as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))
| [1] 18000 18060
| > 18000 /60^2
| [1] 5
| 
| 
| posixlt.to.tz.df <- function(xx) {
|    rr <- as.data.frame(unclass(xx))
|    rr[,"tzone"] <- NA
|    if (!is.na(match("tzone" ,names(attributes(xx)))))
|       rr[,"tzone"] <- attributes(xx)$tzone
|    rr
| }
| 
| # as.POSIXlt:  Is technically capable of expressing "timezone unknown":
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))
|   sec min hour mday mon year wday yday isdst tzone
| 1   0   0    0    1   0   70    4    0     0   GMT
| 2   0   1    0    1   0   70    4    0     0   GMT
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="EST"))
|   sec min hour mday mon year wday yday isdst tzone
| 1   0   0    0    1   0   70    4    0     0   EST
| 2   0   1    0    1   0   70    4    0     0   EST
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))
|   sec min hour mday mon year wday yday isdst tzone
| 1   0   0    0    1   0   70    4    0     0    NA
| 2   0   1    0    1   0   70    4    0     0    NA
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST"))
|   sec min hour mday mon year wday yday isdst tzone
| 1   0   0    0    1   0   70    4    0     0   PST
| 2   0   1    0    1   0   70    4    0     0   PST
| > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800"))
|   sec min hour mday mon year wday yday isdst tzone
| 1   0   0    0    1   0   70    4    0     0 -0800
| 2   0   1    0    1   0   70    4    0     0 -0800
| 
| # as.POSIXct is in fact implemented by using strptime and as.POSIXlt,
| # but here show explicit conversion of lt to ct anyway:
| > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT")))
| [1]  0 60
| > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00"))))
| [1] 18000 18060
| > 18000 /60^2
| [1] 5
| 
| # BUG: Oddly, as.POSIXct() does NOT seem to respect non-GMT non-local
| # timzones like PST:
| > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST")))
| [1]  0 60
| > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800")))
| [1]  0 60
| 
| 
| # Note the (documented) subtlety due to GMT/UTC when converting
| # between class "Date" and "POSIXct":
| > as.POSIXct("2010-03-13 19:00:00 EST")
| [1] "2010-03-13 19:00:00 EST"
| > as.Date(as.POSIXct("2010-03-13 19:00:00 EST"))
| [1] "2010-03-14"
| > as.Date("2010-03-14"))
| [1] "2010-03-14 UTC"
| > as.POSIXct(as.Date("2010-03-14"))
| [1] "2010-03-13 19:00:00 EST"
| 
| 
| # timeDate appears to be much like POSIXct, counting seconds since the
| # Unix epoch, but perhaps its underlying implementation is actually
| # different, given its explicit use of a "units" attribute below:
| 
| > timeDate(0)
| GMT
| [1] [1970-01-01]
| > timeDate("1970-01-01")
| GMT
| [1] [1970-01-01]
| 
| > as.numeric(timeDate("1970-01-01"))
| [1] 0
| attr(,"FinCenter")
| [1] "GMT"
| attr(,"units")
| [1] "secs"
| attr(,"origin")
| [1] "1970-01-01 00:00:00 GMT"
| 
| > timeDate(2)
| GMT
| [1] [1970-01-01 00:00:02]
| 
| > as.numeric(timeDate(2))
| [1] 2
| attr(,"FinCenter")
| [1] "GMT"
| attr(,"units")
| [1] "secs"
| attr(,"origin")
| [1] "1970-01-01 00:00:00 GMT"
| 
| > as.numeric(timeDate(24*60^2))
| [1] 1
| attr(,"FinCenter")
| [1] "GMT"
| attr(,"units")
| [1] "days"
| attr(,"origin")
| [1] "1970-01-01 GMT"
| 
| ------------------------------------------------------------
| 
| -- 
| Andrew Piskorski <atp at piskorski.com>
| http://www.piskorski.com/
| 
| _______________________________________________
| R-SIG-Finance at stat.math.ethz.ch mailing list
| https://stat.ethz.ch/mailman/listinfo/r-sig-finance
| -- Subscriber-posting only. If you want to post, subscribe first.
| -- Also note that this is not the r-help list where general R questions should go.

-- 
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com



More information about the R-SIG-Finance mailing list