[Rd] Efficiency of factor objects

Mon Nov 7 18:48:29 CET 2011

Stavros Macrakis <macrakis <at> alum.mit.edu> writes:
> 
> data.table certainly has some useful mechanisms, and I've been
> experimenting with it as an implementation mechanism, though it's not a
> drop-in substitute for factors.  Also, though it is efficient for set
> operations between small sets and large sets, it is not very efficient for
> operations between two large sets

As a general statement that could do with some clarification ;) data.table 
likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I 
believe) efficient for joining two large 2+ column keyed data sets because the 
upper bound of each row's one-sided binary search is localised in that case (by 
group of the previous key column).

As I understand it, Stavros has a different type of 'two large datasets' : 
English language website data. Each set is one large vector of uniformly 
distributed unique strings. That appears to be quite a different problem to 
multiple columns of many times duplicated data.

Matthew

> Thanks everyone, and if you do come across a relevant CRAN package, I'd be
> very interested in hearing about it.
> 
>           -s
>