[Rd] table() and as.character() performance for logical values
Sebastian Meyer
@eb@meyer @end|ng |rom |@u@de
Wed Apr 9 23:51:42 CEST 2025
Right, thanks! These are non-standard uses of factor(), edge cases I
alluded to. We didn't see any problems with the patch in existing tests
nor in CRAN/BIOC package checks. Note that 'levels' is documented as
an optional vector of the unique values (as character
strings) that ‘x’ might have taken.
E.g.: factor(1L, levels = "1") is identical to factor(1, levels = "1").
Using integers (or logicals) for *both* 'x' and 'levels' still works
(and is used by, e.g., cut.default() internally, is more efficient now,
and could be documented), but that 2L fails to match sqrt(2)^2 doesn't
really come as a surprise.
I'm not sure if it is worth special-casing integer/logical 'x' with
specified non-character 'levels' of a non-conforming type, but yes,
*not* skipping as.character() would then give the more consistent
undocumented behaviour from before the performance patch. Maybe
something to consider for 4.5.1.
Sebastian Meyer
Am 09.04.25 um 08:26 schrieb Suharto Anggono Suharto Anggono via R-devel:
> With the change to 'factor',
> factor(1L, levels = TRUE)
> doesn't give NA, different from
> factor(1, levels = TRUE)
>
> With the change to 'factor',
> factor(TRUE, levels = 1L)
> and
> factor(TRUE, levels = 1)
> don't give NA.
>
> With the change to 'factor',
> factor(2L, levels = sqrt(2)^2)
> gives NA, different from
> factor(2, levels = sqrt(2)^2)
>
> With the change to 'factor',
> factor(2L, exclude = sqrt(2)^2)
> has 1 level (nothing is excluded), different from
> factor(2, exclude = sqrt(2)^2)
>
> ------------
> Am 21.03.25 um 15:42 schrieb Aidan Lakshman via R-devel:
>>> After investigating the source of table, I ended up on the reason being “as.character()”:
>>
>> This is specifically happening within the conversion of the input to type factor, which is where the as.character conversion happens.
>
> Yes, I also think 'factor' could do a bit better for unclassed integers
> (such as when called from 'cut') as well as for logical input (such as
> from 'summary' -> 'table').
>
> Note that 'as.factor' already has a "fast track" for plain integers
> (originally for 'split.default' from 'tapply'), so can be used instead
> of 'factor' when there is no need for custom 'levels', 'labels', or
> 'exclude'. (Thanks for already mentioning 'tabulate'.)
>
> A 'factor' patch would apply more broadly, e.g.:
>
> ===================================================================
> --- src/library/base/R/factor.R (Revision 88042)
> +++ src/library/base/R/factor.R (Arbeitskopie)
> @@ -20,14 +20,18 @@
> exclude = NA, ordered = is.ordered(x), nmax = NA)
> {
> if(is.null(x)) x <- character()
> + directmatch <- !is.object(x) &&
> + (is.character(x) || is.integer(x) || is.logical(x))
> nx <- names(x)
> if (missing(levels)) {
> y <- unique(x, nmax = nmax)
> ind <- order(y)
> - levels <- unique(as.character(y)[ind])
> + if (!directmatch)
> + y <- as.character(y)
> + levels <- unique(y[ind])
> }
> force(ordered) # check if original x is an ordered factor
> - if(!is.character(x))
> + if(!directmatch)
> x <- as.character(x)
> ## levels could be a long vector, but match will not handle that.
> levels <- levels[is.na(match(levels, exclude))]
> f <- match(x, levels)
> ===================================================================
>
> This skips as.character() also for integer/logical 'x' and would indeed
> bring table() runtimes "in order":
>
> set.seed(1)
> C <- sample(c("no", "yes"), 10^7, replace = TRUE)
> F <- as.factor(C)
> L <- F == "yes"
> I <- as.integer(L)
> N <- as.numeric(I)
>
> ## Median system.time(table(.)) in ms:
> ## table(F) 256
> ## table(I) 384 # not 696
> ## table(L) 409 # not 1159
> ## table(C) 591
> ## table(N) 3324
>
> The (seemingly) small patch passes check-all, but maybe it overlooks
> some edge cases. I'd test it on a subset of CRAN/BIOC packages.
>
> Best,
>
> Sebastian Meyer
>
>>
>> # Timing is all on my local machine (OSX)
>> N_v <- sample(c(1,0), 10^7, replace = TRUE)
>> L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
>> # user system elapsed
>> system.time(table(N_v)) # 2.155 0.039 2.192
>> system.time(table(L_v)) # 0.806 0.030 0.838
>>
>> system.time(N_fv <- as.factor(N_v)) # 2.026 0.024 2.050
>> system.time(L_fv <- as.factor(L_v)) # 0.668 0.015 0.683
>>
>> system.time(table(N_fv)) # 0.133 0.022 0.156
>> system.time(table(L_fv)) # 0.134 0.018 0.151
>>
>>> The performance for Integers and specially booleans is quite surprising.
>>
>> Of note is that the performance is significantly better if using `tabulate`, since this doesn't involve a conversion to factor (though input must be numeric/factor, results aren't named, and it has worse handling of NA values). If you have performance critical calls like this you could consider using `tabulate` instead.
>>
>> system.time(tabulate(N_v)) # 0.054 0.002 0.056
>> system.time(tabulate(as.integer(L_v))) # 0.052 0.002 0.055
>>
>>
>> I don't know if this is a known issue or not; most of my colleagues are aware of the slow-down and use `tabulate` when performance is required. My understanding was that the slower performance is a trade-off for more consistent performance (better output, better handling of ambiguities/NA, etc.), and that speed isn't the highest priority with `table`. Maybe someone else has a better understanding of the history of the function.
>>
>> As for improving the speed, it would basically come down to refactoring `table` to not use a `factor` conversion. I'd be concerned about introducing a lot of edge cases with that, but it's theoretically possible. Based on 30 seconds of thinking, it may be possible to do something like:
>>
>> ## just a sketch of a barebones non-factor implementation
>> test_tab <- function(x){
>> lookup <- unique(x)
>> counts <- tabulate(match(x, lookup))
>> names(counts) <- as.character(lookup)
>> counts
>> }
>>
>> system.time(test_tab(L_v)) # 0.101 0.006 0.107
>> system.time(test_tab(N_v)) # 0.129 0.015 0.144
>>
>> This is also faster in the case where there are lots of categories with few entries per category:
>>
>> N_v2 <- 1:1e7
>> system.time(test_tab(N_v2)) # 0.383 0.024 0.411
>> system.time(table(N_v2)) # 6.122 0.228 6.398
>>
>> Obviously there are some big shortcomings:
>> - it's missing a lot of error checking etc. that the standard `table` has
>> - it only works with 1D vectors
>> - NA handling isn't quite the same as `table` (though it would be easy to adapt)
>>
>> Just including to potentially start discussion for optimization.
>>
>> For reference, the relevant section is in src/library/base/R/table.R:L75-85
>>
>> -Aidan
>>
>> -----------------------
>> Aidan Lakshman (he/him)
>> http://www.ahl27.com/
>>
>> On 21 Mar 2025, at 8:26, Karolis Koncevičius wrote:
>>
>>> [You don't often get email from karolis.koncevicius using gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> I was calling table() on some long logical vectors and noticed that it took a long time.
>>>
>>> Out of curiosity I checked the performance of table() on different types, and had some unexpected results:
>>>
>>> C <- sample(c("yes", "no"), 10^7, replace = TRUE)
>>> F <- factor(sample(c("yes", "no"), 10^7, replace = TRUE))
>>> N <- sample(c(1,0), 10^7, replace = TRUE)
>>> I <- sample(c(1L,0L), 10^7, replace = TRUE)
>>> L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
>>>
>>> # ordered by execution time
>>> # user system elapsed
>>> system.time(table(F)) # 0.088 0.006 0.093
>>> system.time(table(C)) # 0.208 0.017 0.224
>>> system.time(table(I)) # 0.242 0.019 0.261
>>> system.time(table(L)) # 0.665 0.015 0.680
>>> system.time(table(N)) # 1.771 0.019 1.791
>>>
>>>
>>> The performance for Integers and specially booleans is quite surprising.
>>> After investigating the source of table, I ended up on the reason being “as.character()”:
>>>
>>> system.time(as.character(L))
>>> user system elapsed
>>> 0.461 0.002 0.462
>>>
>>> Even a manual conversion can achieve a speed-up by a factor of ~7:
>>>
>>> system.time(c("FALSE", "TRUE")[L+1])
>>> user system elapsed
>>> 0.061 0.006 0.067
>>>
>>>
>>> Tested on 4.4.3 as well as devel trunk.
>>>
>>> Just reporting for comments and attention.
>>> Karolis K.
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list