[R] Dynamic Dictionary Data Type?
Duncan Murdoch
murdoch at stats.uwo.ca
Thu Jun 2 21:37:29 CEST 2005
On 6/2/2005 2:40 PM, Prof Brian Ripley wrote:
> On Thu, 2 Jun 2005, Duncan Murdoch wrote:
>
>> Gabor Grothendieck wrote:
>>> On 6/2/05, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>>>
>>>> On Thu, 2 Jun 2005, hadley wickham wrote:
>>>>
>>>>
>>>>>> An environment is a hash table, and copying occurs only on modification,
>>>>>> when any language would have to copy in this context.
>>>>
>>>> Caveat: the default is new.env(hash=FALSE), so an environment is a hash
>>>> table in one sense but not necessarily so in the strict sense.
>>>
>>>
>>> Can you expand on this? When would one use hash = TRUE vs.
>>> the default?
>>
>> It's an optimization question. hash = TRUE uses more memory and takes longer
>> to set up, but will give faster searches in big environments. The current
>> default assumes that environments are small so the effort of building the
>> hash table is not worthwhile, and it does linear searches.
>
> It's not really size: building small hash tables is quick. The issue is
> more to do with whether there are many lookups done compared to entries.
>
> We met the same issues for a named vector a while back. The relevant NEWS
> item was
>
> o Indexing a vector by a character vector was slow if both the
> vector and index were long (say 10,000). Now hashing is used
> and the time should be linear in the longer of the lengths
> (but more memory is used).
>
>
>> I suspect that we might have the default wrong (or perhaps should make the
>> transition from linear to hash search automatically at a certain threshold
>> size), but I haven't done the testing necessary to show this.
>
> Here's an example
>
> tr <- as.character(trunc(runif(1e5, max=100)))
> system.time({
> env <- new.env(hash=F)
> for(i in 1:1e5) assign(tr[i], i, envir=env)
> }, gcFirst=TRUE)
>
> which takes about 5% less with hashing. Now change to max=1e4: the hashed
> version takes about 50% longer, the unhashed one 120x.
>
That tests lots of lookups with fairly big tables. I just ran the
following code to check on relatively few lookups in small tables:
result <- matrix(NA, 10,10)
timeit <- function(lookups, entries, reps=1000) {
tr <- as.character(trunc(runif(lookups, max=entries)))
hash <- system.time({
for (n in 1:reps) {
env <- new.env(hash=T)
for(i in 1:lookups) assign(tr[i], i, envir=env)
}
}, gcFirst=TRUE)[1]
nohash <- system.time({
for (n in 1:reps) {
env <- new.env(hash=F)
for(i in 1:lookups) assign(tr[i], i, envir=env)
}
}, gcFirst=TRUE)[1]
hash/nohash
}
for (i in 1:10) for (j in 1:10) result[i,j] <- timeit(2^i,2^j)
This gives the ratio of times from hashed versus non-hashed for 2^i
lookups from 2^j possible names. The results were as follows:
> round(result, 3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.000 0.333 1.000 1.000 1.000 1.000 0.500 3.000 1.000 0.500
[2,] 1.000 1.000 0.750 1.000 1.000 1.333 1.000 1.000 1.000 1.000
[3,] 0.800 0.600 1.000 1.000 0.800 1.000 0.800 1.000 1.000 1.000
[4,] 1.000 0.800 1.143 1.125 0.889 1.000 1.000 1.000 1.000 0.700
[5,] 1.067 0.938 1.143 1.000 0.941 1.000 1.063 1.067 0.938 1.143
[6,] 1.000 1.034 0.966 0.838 1.000 1.031 1.000 1.000 1.091 1.065
[7,] 1.000 1.000 0.891 0.952 0.984 0.970 0.957 0.960 1.041 0.986
[8,] 1.009 0.992 0.992 0.984 0.904 0.928 0.906 0.905 0.900 0.898
[9,] 1.040 1.008 0.967 0.972 0.957 0.922 0.884 0.778 0.810 0.789
[10,] 1.022 0.998 1.002 0.976 0.930 0.886 0.799 0.759 0.723 0.657
So hashing in the small tables doesn't look much slower (maybe 5%? I
need more reps...) than linear lookup, even taking into account the hash
table creation. I think we should switch to hash=TRUE as our default.
Duncan Murdoch
More information about the R-help
mailing list