[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows
Simon Urbanek
simon.urbanek at r-project.org
Tue Jan 25 20:14:56 CET 2011
On Jan 25, 2011, at 5:49 AM, Karl Ove Hufthammer wrote:
> Matthew Dowle wrote:
>
>> I'm not sure, but note the difference in locale between
>> Linux (UTF-8) and Windows (non UTF-8). As far as I
>> understand it R much prefers UTF-8, which Windows doesn't
>> natively support. Otherwise you could just change your
>> Windows locale to a UTF-8 locale to make R happier.
>>
> [...]
>>
>> If anybody knows a way to trick R on Linux into thinking it has
>> an encoding similar to Windows then I may be able to take a
>> look if I can reproduce the problem in Linux.
>
> Changing the locale to an ISO 8859-1 locale, i.e.:
>
> export LC_ALL="en_US.ISO-8859-1"
> export LANG="en_US.ISO-8859-1"
>
> I could *not* reproduce it; that is, ‘table’ is as fast on the non-ASCII
> factor as it is on the ASCII factor.
>
Strange - are you sure you get the right locale names? Make sure it's listed in locale -a. The above works on my Mac but on my Linux system I have to use LANG=en_US.iso88591 and is *is* replicable albeit with a much smaller hit:
> benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), table(unclass(x.fac.nascii)), replications=20 )
test replications elapsed relative user.self sys.self user.child sys.child
4 table(unclass(x.fac.nascii)) 20 1.028 2.269316 1.020 0.004 0 0
2 table(x.fac.ascii) 20 0.453 1.000000 0.452 0.004 0 0
3 table(x.fac.nascii) 20 2.683 5.922737 2.684 0.000 0 0
1 table(x.num) 20 1.028 2.269316 1.020 0.008 0 0
The main reason is that table() calls factor() which does as.character() which means 10^5 character conversions - a bad idea in that case. Why the penalty is so much higher on Windows that I can't answer at the moment as I'm not on a machine with Windows VM.
FWIW if you care about speed you should use tabulate() instead - it's much faster and incurs no penalty:
> benchmark( tabulate(x.num), tabulate(x.fac.ascii), tabulate(x.fac.nascii), tabulate(unclass(x.fac.nascii)), replications=20 )
test replications elapsed relative user.self sys.self user.child sys.child
4 tabulate(unclass(x.fac.nascii)) 20 0.027 1.421053 0.024 0 0 0
2 tabulate(x.fac.ascii) 20 0.023 1.210526 0.024 0 0 0
3 tabulate(x.fac.nascii) 20 0.024 1.263158 0.020 0 0 0
1 tabulate(x.num) 20 0.019 1.000000 0.020 0 0 0
Cheers,
Simon
More information about the R-devel
mailing list