[Rd] shash in unique.c
Matthew Dowle
mdowle at mdowle.plus.com
Fri Mar 5 13:40:42 CET 2010
Thanks a lot. Quick and brief responses below...
"Duncan Murdoch" <murdoch at stats.uwo.ca> wrote in message
news:4B90F134.6070004 at stats.uwo.ca...
> Matthew Dowle wrote:
>> I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No
>> response resulted in a retry() after a Sys.sleep(10 days).
>>
>> If its a "yes" or "maybe" then I could proceed to try it, test it, and
>> present the test results and timings to you along with the patch. It
>> would be on 32bit Ubuntu first, and I would need to either buy, rent time
>> on, or borrow a 64bit machine to be able to then test there, owing to the
>> nature of the suggestion.
>>
>> If its "no", "bad idea because..." or "we were already working on it, or
>> better", then I won't spend any more time on it.
>>
>> Matthew
>>
>>
>> "Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message
>> news:hlu4qh$l7s$1 at dough.gmane.org...
>>
>>> Looking at shash in unique.c, from R-2.10.1 I'm wondering if it makes
>>> sense to hash the pointer itself rather than the string it points to?
>>> In other words could the SEXP pointer be cast to unsigned int and the
>>> usual scatter be called on that as if it were integer?
>>>
> Two negative but probably not fatal issues:
>
> Pointers and ints are not always the same size. In Win64, ints are 32
> bits, pointers are 64 bits. (Can we be sure there is some integer type
> the same size as a pointer? I don't know, ask a C expert.)
No we can't be sure. But we could test at runtime, and if the assumption
wasn't true, then revert to the existing method.
>
> We might want to save the hash to disk. On restore, the pointer based
> hash would be all wrong. (I don't know if we actually do ever save a hash
> to disk. )
The hash table in unique.c appears to be a temporary private hash, different
to the global R_StringHash. Its private hash appears to be used only while
the call to unique runs, then free'd. Thats my understanding anyway. The
suggestion is not to alter the global R_StringHash in any way at all, which
is the one that might be saved to disk now or in the future.
>
> Duncan Murdoch
>>> shash would look like a slightly modified version of ihash like this :
>>>
>>> static int shash(SEXP x, int indx, HashData *d)
>>> {
>>> if (STRING_ELT(x,indx) == NA_STRING) return 0;
>>> return scatter((unsigned int) (STRING_ELT(x,indx), d);
>>> }
>>>
>>> rather than its current form which appears to hash the string it points
>>> to :
>>>
>>> static int shash(SEXP x, int indx, HashData *d)
>>> {
>>> unsigned int k;
>>> const char *p;
>>> if(d->useUTF8)
>>> p = translateCharUTF8(STRING_ELT(x, indx));
>>> else
>>> p = translateChar(STRING_ELT(x, indx));
>>> k = 0;
>>> while (*p++)
>>> k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */
>>> return scatter(k, d);
>>> }
>>>
>>> Looking at sequal, below, and reading its comments, if the pointers are
>>> equal it doesn't look at the strings they point to, which lead to the
>>> question above.
>>>
>>> static int sequal(SEXP x, int i, SEXP y, int j)
>>> {
>>> if (i < 0 || j < 0) return 0;
>>> /* Two strings which have the same address must be the same,
>>> so avoid looking at the contents */
>>> if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1;
>>> /* Then if either is NA the other cannot be */
>>> /* Once all CHARSXPs are cached, Seql will handle this */
>>> if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING)
>>> return 0;
>>> return Seql(STRING_ELT(x, i), STRING_ELT(y, j));
>>> }
>>>
>>> Matthew
>>>
>>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
More information about the R-devel
mailing list