[R-pkgs] data.table is on CRAN (enhanced data.frame for time series joins and more)

Tue Mar 31 03:37:58 CEST 2009

Dear all,

The data.table package was released back in August 2008. This email is to 
publicise its existence in response to several suggestions to do so. It 
seems I didn't send a general announcement about it at the time and 
therefore perhaps, not surprisingly, not many people know about it. Glancing 
at some r-help threads recently supports the idea of sending a public 
announcement.

The main difference between data.frame and data.table is enhanced 
functionality in [.data.table where most documentation for this package 
lives i.e. help("[.data.table").  Selected extracts from the package 
documentation follow.

The package builds on base R functionality to reduce 2 types of time :
   1. programming time (easier to write, read, debug and maintain)
   2. compute time
when combining database like operations (subset, with and by) and provides 
similar joins that merge provides but faster. This is achieved by using R's 
column based ordered in-memory data.frame, eval within the environment of a 
list (i.e. with), the [.data.table mechanism to condense the features and 
compiled C to make certain operations fast.

[.data.table is like [.data.frame but i and j can be expressions of column 
names directly. Furthermore i may itself be a data.table which invokes a 
fast table join using binary search in O(log n) time. Allowing i to be 
data.table is consistent with subsetting an n-dimension array by an n-column 
matrix in base R. data.tables do not have rownames but may instead have a 
key of one or more columns using setkey. This key may be used for row 
indexing instead of rownames.

Examples comparing [.data.frame and [.data.table :

DF = data.frame(a=1:5, b=6:10)
DT = data.table(a=1:5, b=6:10)

tt = subset(DF,a==3)
ss = DT[a==3]        # just use the column name 'a' directly. No need to 
remember the comma. The i argument is like the 'where' in SQL.
identical(as.data.table(tt), ss)

tt = with(subset(DF,a==3),a+b+1)
ss = DT[a==3,a+b+1]        # j is like select in SQL and the select argument 
of subset in base R.  j can be an expression of column names directly, 
including a data.table of multiple expressions.  Here the j expression is 
executed just for the rows matching the i argument.
identical(tt, ss)

# Examples above use vector scans i.e. the "a==3" expression first creates a 
logical vector as long as the total number of rows and then evaluates a==3 
for every row.
# Examples below use binary search, invoked by passing in a data.table as 
the i argument. Joins in SQL are performed in the where clause and the i 
argument is like where, so this seems very natural (to me anyway!)

DT = data.table(a=letters[1:5], b=6:10)
setkey(DT,a)
identical(DT[J("d")], DT[4])        # binary search to row for 'd'

DT = data.table(id=rep(c("A","B"),each=3), 
date=c(20080501L,20080502L,20080506L), v=1:6)
setkey(DT,id,date)
DT["A"]                                        # all 3 rows for A since mult 
by default is "all"
DT[J("A",20080502L)]                 # row for A where date also matches 
exactly
DT[J("A",20080505L)]                 # NA since 5 May is missing (outer join 
by default)
DT[J("A",20080505L),nomatch=0]             # inner join instead
dts = c(20080501L, 20080502L, 20080505L, 20080506L, 20080507L, 20080508L)
DT[J("A",dts)]                             # 3 of the dates in dts match 
exactly
DT[J("A",dts),roll=TRUE]                   # roll previous data forward i.e. 
return the prevailing observation
DT[J("A",dts),rolltolast=TRUE]             # roll all but last observation 
forward

tables(mb=TRUE)   # prints table names, number of rows, size in memory

Thanks to all those who have made suggestions and feedback so far. Further 
comments and feedback on the package would be much appreciated.

Regards, Matthew