[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows
Karl Ove Hufthammer
karl at huftis.org
Wed Jan 26 14:37:08 CET 2011
Karl Ove Hufthammer wrote:
> Anyway, do you think it’s worth trying to change the ‘table’ function the
> way I outlined in my first post¹? This should eliminate the performance
> hit on all platforms.
Some additional notes: ‘table’ uses ‘factor’ directly, but also indirectly,
in ‘addNA’. The definition of ‘addNA’ ends with:
if (!any(is.na(ll)))
ll <- c(ll, NA)
factor(x, levels = ll, exclude = NULL)
Which is slow for non-ASCII levels. One *could* fix this by changing the
last line to
attr(x, "levels")=ll
But one soon ends up changing every function that uses ‘factor’ in this way,
which seems like the wrong approach. The problems lies inside ‘factor’,
and that’s where it should be fixed, if feasible.
BTW, the defintion of ‘addNA’ looks suboptimal in a different way. The last
line is always executed, even if the factor *does* contain NA values (and of
course NA levels). For this case, basically it’s doing nothing, just taking
a very long time doing it (at least on Windows). Moving the last line inside
the ‘if’ clause, and adding a ‘else return(x)’ would fix this (correct me if
I’m wrong).
--
Karl Ove Hufthammer
More information about the R-devel
mailing list