[R] Dynamic Dictionary Data Type?

Thu Jun 2 21:37:29 CEST 2005

On 6/2/2005 2:40 PM, Prof Brian Ripley wrote:
> On Thu, 2 Jun 2005, Duncan Murdoch wrote:
> 
>> Gabor Grothendieck wrote:
>>> On 6/2/05, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>>> 
>>>> On Thu, 2 Jun 2005, hadley wickham wrote:
>>>> 
>>>> 
>>>>>> An environment is a hash table, and copying occurs only on modification,
>>>>>> when any language would have to copy in this context.
>>>> 
>>>> Caveat: the default is new.env(hash=FALSE), so an environment is a hash
>>>> table in one sense but not necessarily so in the strict sense.
>>> 
>>> 
>>> Can you expand on this?  When would one use hash = TRUE vs.
>>> the default?
>>
>> It's an optimization question.  hash = TRUE uses more memory and takes longer 
>> to set up, but will give faster searches in big environments. The current 
>> default assumes that environments are small so the effort of building the 
>> hash table is not worthwhile, and it does linear searches.
> 
> It's not really size: building small hash tables is quick.  The issue is 
> more to do with whether there are many lookups done compared to entries.
> 
> We met the same issues for a named vector a while back.  The relevant NEWS 
> item was
> 
>      o	Indexing a vector by a character vector was slow if both the
>  	vector and index were long (say 10,000).  Now hashing is used
>  	and the time should be linear in the longer of the lengths
>  	(but more memory is used).
> 
> 
>> I suspect that we might have the default wrong (or perhaps should make the 
>> transition from linear to hash search automatically at a certain threshold 
>> size), but I haven't done the testing necessary to show this.
> 
> Here's an example
> 
> tr <- as.character(trunc(runif(1e5, max=100)))
> system.time({
>    env <- new.env(hash=F)
>    for(i in 1:1e5) assign(tr[i], i, envir=env)
> }, gcFirst=TRUE)
> 
> which takes about 5% less with hashing.  Now change to max=1e4: the hashed 
> version takes about 50% longer, the unhashed one 120x.
> 

That tests lots of lookups with fairly big tables.  I just ran the 
following code to check on relatively few lookups in small tables:

result <- matrix(NA, 10,10)

timeit <- function(lookups, entries, reps=1000) {
     tr <- as.character(trunc(runif(lookups, max=entries)))
     hash <- system.time({
     	for (n in 1:reps) {
     	    env <- new.env(hash=T)
     	    for(i in 1:lookups) assign(tr[i], i, envir=env)
       }
     }, gcFirst=TRUE)[1]
     nohash <- system.time({
     	for (n in 1:reps) {
     	    env <- new.env(hash=F)
     	    for(i in 1:lookups) assign(tr[i], i, envir=env)
     	}
     }, gcFirst=TRUE)[1]
     hash/nohash
}

for (i in 1:10) for (j in 1:10) result[i,j] <- timeit(2^i,2^j)

This gives the ratio of times from hashed versus non-hashed for 2^i 
lookups from 2^j possible names.  The results were as follows:

 > round(result, 3)
        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
  [1,] 1.000 0.333 1.000 1.000 1.000 1.000 0.500 3.000 1.000 0.500
  [2,] 1.000 1.000 0.750 1.000 1.000 1.333 1.000 1.000 1.000 1.000
  [3,] 0.800 0.600 1.000 1.000 0.800 1.000 0.800 1.000 1.000 1.000
  [4,] 1.000 0.800 1.143 1.125 0.889 1.000 1.000 1.000 1.000 0.700
  [5,] 1.067 0.938 1.143 1.000 0.941 1.000 1.063 1.067 0.938 1.143
  [6,] 1.000 1.034 0.966 0.838 1.000 1.031 1.000 1.000 1.091 1.065
  [7,] 1.000 1.000 0.891 0.952 0.984 0.970 0.957 0.960 1.041 0.986
  [8,] 1.009 0.992 0.992 0.984 0.904 0.928 0.906 0.905 0.900 0.898
  [9,] 1.040 1.008 0.967 0.972 0.957 0.922 0.884 0.778 0.810 0.789
[10,] 1.022 0.998 1.002 0.976 0.930 0.886 0.799 0.759 0.723 0.657

So hashing in the small tables doesn't look much slower (maybe 5%?  I 
need more reps...) than linear lookup, even taking into account the hash 
table creation.  I think we should switch to hash=TRUE as our default.

Duncan Murdoch