[Rd] Consider increasing the size of HSIZE

Jim Hester james.f.hester at gmail.com
Tue May 16 20:02:08 CEST 2017


The HSIZE constant, which sets the size of the hash table used to
store symbols is currently defined as `#define HSIZE 4119`. This value
was last increased in r5182 on 1999-07-15.

https://github.com/jimhester/hashsize#readme contains a code which
simulates a normal R workflow by loading a handful of packages. In the
example more than 20,000 symbols are included in the hash table,
resulting in a load factor of greater than 5. The histogram in the
linked repository shows the distribution of bucket sizes for the hash
table.

This high load factor means most queries into the hashtable result in
a collision, requiring an additional linear search of the linked list
for each bucket. Is is common for growable hash tables to increase
their size when the load factor is greater than .75, so I think it
would be of benefit to increase the HSIZE constant considerably; to
32768 or possibly 65536. This will result in increased memory
requirements for the hash table, but far fewer collisions.

To get an idea of the performance implications the repository includes
some benchmarks of looking up the first element in a given hash
bucket, and the last element (for buckets over 10 elements long). The
results are somewhat noisy. Because longer symbol names hashing the
name and performing string comparisons to searching the list tends to
dominate the time. But for symbols of similar length there is a 2X-4X
increase in lookup performance between retrieving the first element in
a bucket to retrieving the last (indicated by the `total` column in
the table).

Increasing the size of `HSIZE` seems like a easy way to improve the
performance of an operation that occurs thousands if not millions of
times for every R session, with very limited cost in memory.



More information about the R-devel mailing list