[Rd] table() and as.character() performance for logical values

Fri Mar 21 15:42:21 CET 2025

Some small points to add on this discussion:

> After investigating the source of table, I ended up on the reason being “as.character()”:

This is specifically happening within the conversion of the input to type factor, which is where the as.character conversion happens.

  # Timing is all on my local machine (OSX)
  N_v <- sample(c(1,0), 10^7, replace = TRUE)
  L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
                                         #  user  system elapsed
  system.time(table(N_v))                # 2.155   0.039   2.192
  system.time(table(L_v))                # 0.806   0.030   0.838

  system.time(N_fv <- as.factor(N_v))    # 2.026   0.024   2.050
  system.time(L_fv <- as.factor(L_v))    # 0.668   0.015   0.683

  system.time(table(N_fv))               # 0.133   0.022   0.156
  system.time(table(L_fv))               # 0.134   0.018   0.151

> The performance for Integers and specially booleans is quite surprising.

Of note is that the performance is significantly better if using `tabulate`, since this doesn't involve a conversion to factor (though input must be numeric/factor, results aren't named, and it has worse handling of NA values). If you have performance critical calls like this you could consider using `tabulate` instead.

  system.time(tabulate(N_v))             # 0.054   0.002   0.056
  system.time(tabulate(as.integer(L_v))) # 0.052   0.002   0.055

I don't know if this is a known issue or not; most of my colleagues are aware of the slow-down and use `tabulate` when performance is required. My understanding was that the slower performance is a trade-off for more consistent performance (better output, better handling of ambiguities/NA, etc.), and that speed isn't the highest priority with `table`. Maybe someone else has a better understanding of the history of the function.

As for improving the speed, it would basically come down to refactoring `table` to not use a `factor` conversion. I'd be concerned about introducing a lot of edge cases with that, but it's theoretically possible. Based on 30 seconds of thinking, it may be possible to do something like:

## just a sketch of a barebones non-factor implementation
  test_tab <- function(x){
    lookup <- unique(x)
    counts <- tabulate(match(x, lookup))
    names(counts) <- as.character(lookup)
    counts
  }

  system.time(test_tab(L_v))  # 0.101   0.006   0.107
  system.time(test_tab(N_v))  # 0.129   0.015   0.144

This is also faster in the case where there are lots of categories with few entries per category:

  N_v2 <- 1:1e7
  system.time(test_tab(N_v2)) # 0.383   0.024   0.411
  system.time(table(N_v2))    # 6.122   0.228   6.398

Obviously there are some big shortcomings:
- it's missing a lot of error checking etc. that the standard `table` has
- it only works with 1D vectors
- NA handling isn't quite the same as `table` (though it would be easy to adapt)

Just including to potentially start discussion for optimization.

For reference, the relevant section is in src/library/base/R/table.R:L75-85

-Aidan

-----------------------
Aidan Lakshman (he/him)
http://www.ahl27.com/

On 21 Mar 2025, at 8:26, Karolis Koncevičius wrote:

> [You don't often get email from karolis.koncevicius using gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> I was calling table() on some long logical vectors and noticed that it took a long time.
>
> Out of curiosity I checked the performance of table() on different types, and had some unexpected results:
>
>     C <- sample(c("yes", "no"), 10^7, replace = TRUE)
>     F <- factor(sample(c("yes", "no"), 10^7, replace = TRUE))
>     N <- sample(c(1,0), 10^7, replace = TRUE)
>     I <- sample(c(1L,0L), 10^7, replace = TRUE)
>     L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
>
>                            # ordered by execution time
>                            #   user  system elapsed
>     system.time(table(F))  #  0.088   0.006   0.093
>     system.time(table(C))  #  0.208   0.017   0.224
>     system.time(table(I))  #  0.242   0.019   0.261
>     system.time(table(L))  #  0.665   0.015   0.680
>     system.time(table(N))  #  1.771   0.019   1.791
>
>
> The performance for Integers and specially booleans is quite surprising.
> After investigating the source of table, I ended up on the reason being “as.character()”:
>
>     system.time(as.character(L))
>      user  system elapsed
>     0.461   0.002   0.462
>
> Even a manual conversion can achieve a speed-up by a factor of ~7:
>
>     system.time(c("FALSE", "TRUE")[L+1])
>      user  system elapsed
>     0.061   0.006   0.067
>
>
> Tested on 4.4.3 as well as devel trunk.
>
> Just reporting for comments and attention.
> Karolis K.
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel