[R-SIG-Finance] [SPAM] - fast date/time format that handles both known andunknown timezones? - Email found in subject
David Reiner
David.Reiner at xrtrading.com
Fri Aug 13 23:37:22 CEST 2010
Those are lots of good questions.
As far as the time zone, most data vendors I am aware of store dates in their databases as UTC (slightly different from GMT.)
Converting back and forth from local time can be a pain, (especially if you forget!)
But the certainty of knowing what is in your DB overcomes that, I think,
merging data from different time zones is a snap, and you save storage.
Just think about combining German and US data over an interval in the spring or fall,
when summer time/day light savings comes and goes on different dates.
My two cents' worth.
-- David
-----Original Message-----
From: r-sig-finance-bounces at stat.math.ethz.ch [mailto:r-sig-finance-bounces at stat.math.ethz.ch] On Behalf Of Andrew Piskorski
Sent: Friday, August 13, 2010 3:27 PM
To: r-sig-finance at stat.math.ethz.ch
Subject: [SPAM] - [R-SIG-Finance] fast date/time format that handles both known andunknown timezones? - Email found in subject
How do folks here prefer to represent date and time stamps on market
tick data, and how have you dealt with time zones? Any lessons
learned?
The tick data I have to work with comes with a string date and time
stamp, which typically is in the local exchange timezone but has no
explicit timezone information anywhere in the data.
Now, often I don't care about the timezone anyway, the exchange local
time (or even just elapsed time since market open) is all I'll need.
However, I'll sometimes want to carefully line up timestamps across
different exchanges in different timezones. In those cases, I'd
prefer to completely ignore timezones until THEN when I actually need
them.
So ideally, I'd like to pick a single date and time format now which
meets all of these (draft) criteria:
1. Fast and compact, for use with tick data.
2. Relatively simple low-level representation, to ease
interoperability with C and/or other non-R programming languages.
3. Able to cleanly and unambiguously represent date-time stamps both
with an unspecified (currently unknown) timezone, and with any valid
timezone.
Thus, I'm very curious what others have used, and what criteria you
found to actually be important in practice...
These are all the R date/time options I'm aware of: POSIXct, POSIXlt,
chron, and timeDate. Did I miss any important ones? as.Date is
useful of course, but only for dates, not times. Here's what I think
of each of those options so far:
POSIXct's date and time representation is exactly that of Unix, a
count of seconds since 1970-01-01 in GMT. Nice and simple. However,
that inherently means that there isn't really any such thing as an
unspecified timezone. SOME timezone must immediately be assumed in
order to get your data into the underlying seconds-since-GMT-epoch
format.
timeDate appears to be a lot like POSIXct, but it has other features
and I'm not sure whether it's underlying representation is really the
same or not. I've seen some contradictory information out there on
what it does and what it's really good for; should I investigate
further?
POSIXlt is effectively an R wrapper around the standard Unix libc tm
structure used by strptime(3). Thus, it is timezone agnostic, and can
easily represent a date and time with the timezone unspecified or with
any particular timezone. However, it seems unwieldy and annoying to
work with.
chron uses Julian days, presumably in GMT. By default days are
counted since the Unix epoch of 1970-01-01, but if you pass in
origin=0 you get standard Julian day numbers, where 1970-01-01 =
2440588. Times are fractional days. However, chron claims to have no
timezone support at all.
as.Date (class "Date"), like chron, uses Julian days, but its origin
is always 1970-01-01, you cannot change it. It deals solely with
dates, not times.
Of the above, POSIXct was my obvious initial candidate, it's only
timezone handling that made me think twice. If I do use POSIXct,
AFAICT my options are either:
1. Put in the work upfront to make sure I have all my timezones
correct for all my data. I am leery of actually getting that truly
correct, particularly with daylight savings time differences, possible
arbitrary changes in data vendor practices over time, etc.
2. Or, pretend that all my timestamps are actually in GMT, save them
that way on disk, and convert later if/when I need to use the true
exchange-local timezone.
3. Something in between the two above approaches, pick a likely
timezone and save the data in that, regardless of how certain I am
that it's really 100% correct. However, that then suggests...
4. Build some sort of wrapper around POSIXct which lets me be clearer
about whether I know the correct timezone or not.
And of course there are plenty of ways to build something from scratch
without directly involving POSIXct at all. (Perhaps even with some
sort of clever flag bits to optionally note timezones embedded within
an otherwise vanilla fractional Julian day or Unix epoch based value.)
Finally, below is some R output to demonstrate the apparent date/time
representations used by various R functions and classes:
------------------------------------------------------------
# R's default Julian date origin is 1970-01-01, but if you pass in
# origin=0, you get standard Julian Day Numbers where 1970-01-01 is
# JDN 2440588. That also happens to agree exactly with Oracle 10g.
#
# Oddly, with origin=0 R seems to think that JDN 0 is 4713-11-24 BC
# rather than either the 4712-01-01 BC from Oracle or the 4713-01-01
# BC given on Wikipedia. However this does not seem to make any
# practical difference.
# http://en.wikipedia.org/wiki/Julian_day
# chron and its helper functions julian, month.day.year, etc.:
> unlist(month.day.year(0))
month day year
1 1 1970
> unlist(month.day.year(0 ,origin=0))
month day year
11 24 -4713
> chron:::julian.default(1,1,1970)
[1] 0
> chron:::julian.default(1,1,1970 ,origin=0)
[1] 2440588
> print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s"))) ,digits=17)
[1] 0.0000000000000000000 0.0006944444444444445
> print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s") ,origin=0)) ,digits=17)
[1] 2440588.000000000 2440588.000694444
> 0.0006944444444444445 *24*60
[1] 1
oracle> select to_char(to_date('4712-01-01 BC','YYYY-MM-DD BC'),'J') as julian from dual;
JULIAN
-------
0000001
oracle> select to_char(to_date('1970-01-01 AD','YYYY-MM-DD BC'),'J') as julian from dual;
JULIAN
-------
2440588
oracle> select to_char(to_date('1','J') ,'BC YYYY-MM-DD HH24:MI:SS') as origin from dual;
ORIGIN
----------------------
BC 4712-01-01 00:00:00
oracle> select to_char(to_date('2440588','J') ,'BC YYYY-MM-DD HH24:MI:SS') as epoch from dual;
EPOCH
----------------------
AD 1970-01-01 00:00:00
oracle> select to_char(to_date('4713-11-24 BC','YYYY-MM-DD BC'),'J') as julian from dual;
ERROR at line 1:
ORA-01841: (full) year must be between -4713 and +9999, and not be 0
# as.Date: as.Date.character uses strptime, and origin=0 has no effect.
> as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02")))
[1] -1 0 1
> as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02") ,origin=0))
[1] -1 0 1
# as.POSIXct: Default is local timezone, which happens to be EST -0500:
> as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))
[1] 0 60
> as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))
[1] 18000 18060
> 18000 /60^2
[1] 5
posixlt.to.tz.df <- function(xx) {
rr <- as.data.frame(unclass(xx))
rr[,"tzone"] <- NA
if (!is.na(match("tzone" ,names(attributes(xx)))))
rr[,"tzone"] <- attributes(xx)$tzone
rr
}
# as.POSIXlt: Is technically capable of expressing "timezone unknown":
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))
sec min hour mday mon year wday yday isdst tzone
1 0 0 0 1 0 70 4 0 0 GMT
2 0 1 0 1 0 70 4 0 0 GMT
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="EST"))
sec min hour mday mon year wday yday isdst tzone
1 0 0 0 1 0 70 4 0 0 EST
2 0 1 0 1 0 70 4 0 0 EST
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))
sec min hour mday mon year wday yday isdst tzone
1 0 0 0 1 0 70 4 0 0 NA
2 0 1 0 1 0 70 4 0 0 NA
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST"))
sec min hour mday mon year wday yday isdst tzone
1 0 0 0 1 0 70 4 0 0 PST
2 0 1 0 1 0 70 4 0 0 PST
> posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800"))
sec min hour mday mon year wday yday isdst tzone
1 0 0 0 1 0 70 4 0 0 -0800
2 0 1 0 1 0 70 4 0 0 -0800
# as.POSIXct is in fact implemented by using strptime and as.POSIXlt,
# but here show explicit conversion of lt to ct anyway:
> as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT")))
[1] 0 60
> as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00"))))
[1] 18000 18060
> 18000 /60^2
[1] 5
# BUG: Oddly, as.POSIXct() does NOT seem to respect non-GMT non-local
# timzones like PST:
> as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST")))
[1] 0 60
> as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800")))
[1] 0 60
# Note the (documented) subtlety due to GMT/UTC when converting
# between class "Date" and "POSIXct":
> as.POSIXct("2010-03-13 19:00:00 EST")
[1] "2010-03-13 19:00:00 EST"
> as.Date(as.POSIXct("2010-03-13 19:00:00 EST"))
[1] "2010-03-14"
> as.Date("2010-03-14"))
[1] "2010-03-14 UTC"
> as.POSIXct(as.Date("2010-03-14"))
[1] "2010-03-13 19:00:00 EST"
# timeDate appears to be much like POSIXct, counting seconds since the
# Unix epoch, but perhaps its underlying implementation is actually
# different, given its explicit use of a "units" attribute below:
> timeDate(0)
GMT
[1] [1970-01-01]
> timeDate("1970-01-01")
GMT
[1] [1970-01-01]
> as.numeric(timeDate("1970-01-01"))
[1] 0
attr(,"FinCenter")
[1] "GMT"
attr(,"units")
[1] "secs"
attr(,"origin")
[1] "1970-01-01 00:00:00 GMT"
> timeDate(2)
GMT
[1] [1970-01-01 00:00:02]
> as.numeric(timeDate(2))
[1] 2
attr(,"FinCenter")
[1] "GMT"
attr(,"units")
[1] "secs"
attr(,"origin")
[1] "1970-01-01 00:00:00 GMT"
> as.numeric(timeDate(24*60^2))
[1] 1
attr(,"FinCenter")
[1] "GMT"
attr(,"units")
[1] "days"
attr(,"origin")
[1] "1970-01-01 GMT"
------------------------------------------------------------
--
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/
_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only. If you want to post, subscribe first.
-- Also note that this is not the r-help list where general R questions should go.
This e-mail and any materials attached hereto, including, without limitation, all content hereof and thereof (collectively, "XR Content") are confidential and proprietary to XR Trading, LLC ("XR") and/or its affiliates, and are protected by intellectual property laws. Without the prior written consent of XR, the XR Content may not (i) be disclosed to any third party or (ii) be reproduced or otherwise used by anyone other than current employees of XR or its affiliates, on behalf of XR or its affiliates.
THE XR CONTENT IS PROVIDED AS IS, WITHOUT REPRESENTATIONS OR WARRANTIES OF ANY KIND. TO THE MAXIMUM EXTENT PERMISSIBLE UNDER APPLICABLE LAW, XR HEREBY DISCLAIMS ANY AND ALL WARRANTIES, EXPRESS AND IMPLIED, RELATING TO THE XR CONTENT, AND NEITHER XR NOR ANY OF ITS AFFILIATES SHALL IN ANY EVENT BE LIABLE FOR ANY DAMAGES OF ANY NATURE WHATSOEVER, INCLUDING, BUT NOT LIMITED TO, DIRECT, INDIRECT, CONSEQUENTIAL, SPECIAL AND PUNITIVE DAMAGES, LOSS OF PROFITS AND TRADING LOSSES, RESULTING FROM ANY PERSON'S USE OR RELIANCE UPON, OR INABILITY TO USE, ANY XR CONTENT, EVEN IF XR IS ADVISED OF THE POSSIBILITY OF SUCH DAMAGES OR IF SUCH DAMAGES WERE FORESEEABLE.
More information about the R-SIG-Finance
mailing list