[Rd] table() and as.character() performance for logical values
Aidan Lakshman
AHL27 @end|ng |rom p|tt@edu
Fri Mar 21 15:42:21 CET 2025
Some small points to add on this discussion:
> After investigating the source of table, I ended up on the reason being “as.character()”:
This is specifically happening within the conversion of the input to type factor, which is where the as.character conversion happens.
# Timing is all on my local machine (OSX)
N_v <- sample(c(1,0), 10^7, replace = TRUE)
L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
# user system elapsed
system.time(table(N_v)) # 2.155 0.039 2.192
system.time(table(L_v)) # 0.806 0.030 0.838
system.time(N_fv <- as.factor(N_v)) # 2.026 0.024 2.050
system.time(L_fv <- as.factor(L_v)) # 0.668 0.015 0.683
system.time(table(N_fv)) # 0.133 0.022 0.156
system.time(table(L_fv)) # 0.134 0.018 0.151
> The performance for Integers and specially booleans is quite surprising.
Of note is that the performance is significantly better if using `tabulate`, since this doesn't involve a conversion to factor (though input must be numeric/factor, results aren't named, and it has worse handling of NA values). If you have performance critical calls like this you could consider using `tabulate` instead.
system.time(tabulate(N_v)) # 0.054 0.002 0.056
system.time(tabulate(as.integer(L_v))) # 0.052 0.002 0.055
I don't know if this is a known issue or not; most of my colleagues are aware of the slow-down and use `tabulate` when performance is required. My understanding was that the slower performance is a trade-off for more consistent performance (better output, better handling of ambiguities/NA, etc.), and that speed isn't the highest priority with `table`. Maybe someone else has a better understanding of the history of the function.
As for improving the speed, it would basically come down to refactoring `table` to not use a `factor` conversion. I'd be concerned about introducing a lot of edge cases with that, but it's theoretically possible. Based on 30 seconds of thinking, it may be possible to do something like:
## just a sketch of a barebones non-factor implementation
test_tab <- function(x){
lookup <- unique(x)
counts <- tabulate(match(x, lookup))
names(counts) <- as.character(lookup)
counts
}
system.time(test_tab(L_v)) # 0.101 0.006 0.107
system.time(test_tab(N_v)) # 0.129 0.015 0.144
This is also faster in the case where there are lots of categories with few entries per category:
N_v2 <- 1:1e7
system.time(test_tab(N_v2)) # 0.383 0.024 0.411
system.time(table(N_v2)) # 6.122 0.228 6.398
Obviously there are some big shortcomings:
- it's missing a lot of error checking etc. that the standard `table` has
- it only works with 1D vectors
- NA handling isn't quite the same as `table` (though it would be easy to adapt)
Just including to potentially start discussion for optimization.
For reference, the relevant section is in src/library/base/R/table.R:L75-85
-Aidan
-----------------------
Aidan Lakshman (he/him)
http://www.ahl27.com/
On 21 Mar 2025, at 8:26, Karolis Koncevičius wrote:
> [You don't often get email from karolis.koncevicius using gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> I was calling table() on some long logical vectors and noticed that it took a long time.
>
> Out of curiosity I checked the performance of table() on different types, and had some unexpected results:
>
> C <- sample(c("yes", "no"), 10^7, replace = TRUE)
> F <- factor(sample(c("yes", "no"), 10^7, replace = TRUE))
> N <- sample(c(1,0), 10^7, replace = TRUE)
> I <- sample(c(1L,0L), 10^7, replace = TRUE)
> L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
>
> # ordered by execution time
> # user system elapsed
> system.time(table(F)) # 0.088 0.006 0.093
> system.time(table(C)) # 0.208 0.017 0.224
> system.time(table(I)) # 0.242 0.019 0.261
> system.time(table(L)) # 0.665 0.015 0.680
> system.time(table(N)) # 1.771 0.019 1.791
>
>
> The performance for Integers and specially booleans is quite surprising.
> After investigating the source of table, I ended up on the reason being “as.character()”:
>
> system.time(as.character(L))
> user system elapsed
> 0.461 0.002 0.462
>
> Even a manual conversion can achieve a speed-up by a factor of ~7:
>
> system.time(c("FALSE", "TRUE")[L+1])
> user system elapsed
> 0.061 0.006 0.067
>
>
> Tested on 4.4.3 as well as devel trunk.
>
> Just reporting for comments and attention.
> Karolis K.
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list