[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows
Karl Ove Hufthammer
karl at huftis.org
Wed Jan 26 09:31:43 CET 2011
Simon Urbanek wrote:
>> I could *not* reproduce it; that is, ‘table’ is as fast on the non-ASCII
>> factor as it is on the ASCII factor.
>
> Strange - are you sure you get the right locale names? Make sure it's
> listed in locale -a.
Yes, I managed to reproduce it now, using a locale listed in ‘locale -a’.
There is a performance hit, though *much* smaller than on Windows.
> FWIW if you care about speed you should use tabulate() instead - it's much
> faster and incurs no penalty:
Yes, that the solution I ended up using:
res = tabulate(x, nbins=nlevels(x)) # nbins needed for levels that don’t occur
names(res) = levels(x)
res
(Though I’m not sure it’s *guaranteed* that factors are internally stored in a
way that make this works, i.e., as the numbers 1, 2, ... for level 1, 2 ...)
Anyway, do you think it’s worth trying to change the ‘table’ function the way I
outlined in my first post¹? This should eliminate the performance hit on all
platforms. However, it will introduce a performance hit (CPU and memory use)
if the elements of ‘exclude’ make up a large part of the factor(s).
¹ http://permalink.gmane.org/gmane.comp.lang.r.devel/26576
--
Karl Ove Hufthammer
More information about the R-devel
mailing list