AW: AW: [R] constructing specially ordered factor

Thomas Lumley tlumley at u.washington.edu
Tue Oct 5 16:38:09 CEST 2004


On Tue, 5 Oct 2004, Khamenia, Valery wrote:
> BTW, do you mean that current hash-based implementation brings *clearly*
> better performance than any O(n*log(n)) sort based algorithm?
> If I have correctly understood src/main/unique.c then current
> hash function is niether minimal perfect hash function nor even
> minimal hash function. In addition, as might be expected,
> current hash function uses full pass through the string to get
> a hash key.
>
> So, in particular, can anyone clearly show that the current
> hash-based algorithm will be quicker then sort-based
> algorithm if the input has:
>
> 1. a lot of strings;
> 2. strings are very long;
> 3. strings are quite unsimilar
>
> ?
>
> Hm, I don't believe you are ready to prove smth. like that.

Clearly the algorithm is not optimal for all possible sets of 
inputs (eg if the inputs are already known to be unique then an even 
faster implementation is just to do nothing).

As R's string comparison function use C's strcmp, for which the C standard 
makes no performance guarantees whatsoever, it is not possible to prove 
*anything* about the relative performance of algorithms without making 
some additional assumptions.

However, if this was a serious question, it is known that
a) in some versions of R on some byte-orderings the hash was broken and 
the performance was far inferior, suggesting that it is ordinarily quite 
effective.
b) Switching the rowsum function from a sort-based implementation to 
hashing produced a substantial speed increase.

 	-thomas




More information about the R-help mailing list