[Rd] R with Array Hashes

Jeffrey Horner jeffrey.horner at gmail.com
Fri Mar 6 16:19:18 CET 2015


Hi,

I wanted to share with the mailing list members here details about the
project I've been working on:

https://github.com/jeffreyhorner/R-Array-Hash

This is a re-implementation of R's hashed environments, the global
variable cache, the global string cache and symbol table with
cache-conscious array hash tables. The results are quite encouraging.
However, the implementation is a big departure from R's API:

"An array hash is a cache-conscious data structure that takes
advantage of hardware prefetchers for improved performance on large
hash tables, those large enough to fit in main memory and larger than
fast fixed size cpu caches.

However, their implementation is a radical departure from standard
chained hash tables. Rather than using chains of hash buckets for
collision resolution, array hashes use segements of contiguous memory
called dynamic arrays to store keys and values. Adding and deleting
items from the hash involve copying the entire segment to new areas in
memory. While this may seem wasteful and slow, it's surprisingly
efficient in both time and space.

In R, hashed environments are implemented using lists with each list
element (a CONS cell) acting as the hash bucket. The CONS cell is the
binding agent for a symbol and value. Hashed environments are searched
using the pointer address of the symbol rather than the symbol's
printed name.

R-Array-Hash takes advantage of this by implementing an integer array
hash to store addresses of symbols and their associated values. Care
is also taken to account for whether or not a binding is locked,
active, etc.

Similarly, R-Array-Hash re-implements R's string cache using a string
array hash. This introduces the most radical change to R's API: CHAR()
no longer returns an address that points to the area at the end of the
SEXP (containing the string value). Rather it returns an address
located in one of the contiguous dynamic arrays of the string hash
table. Therefore, care must be taken in C code to use the address
immediately since additions and deletions to the string hash could
render the result of CHAR() useless. There are many areas of the code
that sidestep this by calling translateChar(), which has been changed
to always copy the string pointed by CHAR()."

Comments, constructive or otherwise are welcome.

Best,

Jeff



More information about the R-devel mailing list