[Rd] match function causing bad performance when using table function on factors with multibyte characters on Windows
Karl Ove Hufthammer
karl at huftis.org
Fri Jan 21 10:47:56 CET 2011
[I originally posted this on the R-help mailing list, and it was suggested that R-devel would be a better
place to dicuss it.]
Running ‘table’ on a factor with levels containing non-ASCII characters
seems to result in extremely bad performance on Windows. Here’s a simple
example with benchmark results (I’ve reduced the number of replications to
make the function finish within reasonable time):
library(rbenchmark)
x.num=sample(1:2, 10^5, replace=TRUE)
x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B"))
x.fac.nascii=factor(x.num, levels=1:2, labels=c("Æ","Ø"))
benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), table(unclass(x.fac.nascii)), replications=20 )
test replications elapsed relative user.self sys.self user.child sys.child
4 table(unclass(x.fac.nascii)) 20 1.53 4.636364 1.51 0.01 NA NA
2 table(x.fac.ascii) 20 0.33 1.000000 0.33 0.00 NA NA
3 table(x.fac.nascii) 20 146.67 444.454545 38.52 81.74 NA NA
1 table(x.num) 20 1.55 4.696970 1.53 0.01 NA NA
sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 LC_CTYPE=Norwegian-Nynorsk_Norway.1252 LC_MONETARY=Norwegian-Nynorsk_Norway.1252
[4] LC_NUMERIC=C LC_TIME=Norwegian-Nynorsk_Norway.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] rbenchmark_0.3
The timings are from R 2.12.1, but I also get comparable results
on the latest prelease (R 2.13.0 2011-01-18 r54032).
Running the same test (100 replications) on a Linux system with
R.12.1 Patched results in essentially no difference between the
performance on ASCII factors and non-ASCII factors:
test replications elapsed relative user.self sys.self user.child sys.child
4 table(unclass(x.fac.nascii)) 100 4.607 3.096102 4.455 0.092 0 0
2 table(x.fac.ascii) 100 1.488 1.000000 1.459 0.028 0 0
3 table(x.fac.nascii) 100 1.616 1.086022 1.560 0.051 0 0
1 table(x.num) 100 4.504 3.026882 4.403 0.079 0 0
sessionInfo()
R version 2.12.1 Patched (2011-01-18 r54033)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C LC_TIME=nn_NO.UTF-8
[4] LC_COLLATE=nn_NO.UTF-8 LC_MONETARY=C LC_MESSAGES=nn_NO.UTF-8
[7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rbenchmark_0.3
Profiling the ‘table’ function indicates almost all the time if spent in
the ‘match’ function, which is used when ‘factor’ is used on a ‘factor’
inside ‘table’. Indeed, ‘x.fac.nascii = factor(x.fac.nascii)’ by itself
is extremely slow.
Is there any theoretical reason ‘factor’ on ‘factor’ with non-ASCII
characters must be so slow? And why doesn’t this happen on Linux?
Perhaps a fix for ‘table’ might be calculating the ‘table’ statistics
*including* all levels (not using the ‘factor’ function anywhere),
and then removing the ‘exclude’ levels in the end. For example,
something along these lines:
res = table.modified.to.not.use.factor(...)
ind = lapply(dimnames(res), function(x) !(x %in% exclude))
do.call("[", c(list(res), ind, drop=FALSE))
(I haven’t tested this very much, so there may be issues with this
way of doing things.)
--
Karl Ove Hufthammer
More information about the R-devel
mailing list