[Rd] Efficiency of factor objects
mdowle at mdowle.plus.com
Mon Nov 7 18:48:29 CET 2011
Stavros Macrakis <macrakis <at> alum.mit.edu> writes:
> data.table certainly has some useful mechanisms, and I've been
> experimenting with it as an implementation mechanism, though it's not a
> drop-in substitute for factors. Also, though it is efficient for set
> operations between small sets and large sets, it is not very efficient for
> operations between two large sets
As a general statement that could do with some clarification ;) data.table
likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I
believe) efficient for joining two large 2+ column keyed data sets because the
upper bound of each row's one-sided binary search is localised in that case (by
group of the previous key column).
As I understand it, Stavros has a different type of 'two large datasets' :
English language website data. Each set is one large vector of uniformly
distributed unique strings. That appears to be quite a different problem to
multiple columns of many times duplicated data.
> Thanks everyone, and if you do come across a relevant CRAN package, I'd be
> very interested in hearing about it.
More information about the R-devel