[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

Karl Ove Hufthammer karl at huftis.org
Wed Jan 26 14:37:08 CET 2011


Karl Ove Hufthammer wrote:

> Anyway, do you think it’s worth trying to change the ‘table’ function the
> way I outlined in my first post¹? This should eliminate the performance
> hit on all platforms.

Some additional notes: ‘table’ uses ‘factor’ directly, but also indirectly, 
in ‘addNA’. The definition of ‘addNA’ ends with:

    if (!any(is.na(ll))) 
        ll <- c(ll, NA)
    factor(x, levels = ll, exclude = NULL)

Which is slow for non-ASCII levels. One *could* fix this by changing the 
last line to

  attr(x, "levels")=ll

But one soon ends up changing every function that uses ‘factor’ in this way, 
which seems like the wrong approach. The problems lies inside ‘factor’,
and that’s where it should be fixed, if feasible.

BTW, the defintion of ‘addNA’ looks suboptimal in a different way. The last 
line is always executed, even if the factor *does* contain NA values (and of 
course NA levels). For this case, basically it’s doing nothing, just taking 
a very long time doing it (at least on Windows). Moving the last line inside 
the ‘if’ clause, and adding a ‘else return(x)’ would fix this (correct me if 
I’m wrong).

-- 
Karl Ove Hufthammer



More information about the R-devel mailing list