[R-pkgs] data.table is on CRAN (enhanced data.frame for time series joins and more)
Matthew Dowle
mdowle at mdowle.plus.com
Tue Mar 31 03:37:58 CEST 2009
Dear all,
The data.table package was released back in August 2008. This email is to
publicise its existence in response to several suggestions to do so. It
seems I didn't send a general announcement about it at the time and
therefore perhaps, not surprisingly, not many people know about it. Glancing
at some r-help threads recently supports the idea of sending a public
announcement.
The main difference between data.frame and data.table is enhanced
functionality in [.data.table where most documentation for this package
lives i.e. help("[.data.table"). Selected extracts from the package
documentation follow.
The package builds on base R functionality to reduce 2 types of time :
1. programming time (easier to write, read, debug and maintain)
2. compute time
when combining database like operations (subset, with and by) and provides
similar joins that merge provides but faster. This is achieved by using R's
column based ordered in-memory data.frame, eval within the environment of a
list (i.e. with), the [.data.table mechanism to condense the features and
compiled C to make certain operations fast.
[.data.table is like [.data.frame but i and j can be expressions of column
names directly. Furthermore i may itself be a data.table which invokes a
fast table join using binary search in O(log n) time. Allowing i to be
data.table is consistent with subsetting an n-dimension array by an n-column
matrix in base R. data.tables do not have rownames but may instead have a
key of one or more columns using setkey. This key may be used for row
indexing instead of rownames.
Examples comparing [.data.frame and [.data.table :
DF = data.frame(a=1:5, b=6:10)
DT = data.table(a=1:5, b=6:10)
tt = subset(DF,a==3)
ss = DT[a==3] # just use the column name 'a' directly. No need to
remember the comma. The i argument is like the 'where' in SQL.
identical(as.data.table(tt), ss)
tt = with(subset(DF,a==3),a+b+1)
ss = DT[a==3,a+b+1] # j is like select in SQL and the select argument
of subset in base R. j can be an expression of column names directly,
including a data.table of multiple expressions. Here the j expression is
executed just for the rows matching the i argument.
identical(tt, ss)
# Examples above use vector scans i.e. the "a==3" expression first creates a
logical vector as long as the total number of rows and then evaluates a==3
for every row.
# Examples below use binary search, invoked by passing in a data.table as
the i argument. Joins in SQL are performed in the where clause and the i
argument is like where, so this seems very natural (to me anyway!)
DT = data.table(a=letters[1:5], b=6:10)
setkey(DT,a)
identical(DT[J("d")], DT[4]) # binary search to row for 'd'
DT = data.table(id=rep(c("A","B"),each=3),
date=c(20080501L,20080502L,20080506L), v=1:6)
setkey(DT,id,date)
DT["A"] # all 3 rows for A since mult
by default is "all"
DT[J("A",20080502L)] # row for A where date also matches
exactly
DT[J("A",20080505L)] # NA since 5 May is missing (outer join
by default)
DT[J("A",20080505L),nomatch=0] # inner join instead
dts = c(20080501L, 20080502L, 20080505L, 20080506L, 20080507L, 20080508L)
DT[J("A",dts)] # 3 of the dates in dts match
exactly
DT[J("A",dts),roll=TRUE] # roll previous data forward i.e.
return the prevailing observation
DT[J("A",dts),rolltolast=TRUE] # roll all but last observation
forward
tables(mb=TRUE) # prints table names, number of rows, size in memory
Thanks to all those who have made suggestions and feedback so far. Further
comments and feedback on the package would be much appreciated.
Regards, Matthew
More information about the R-packages
mailing list