[Rd] strcapture performance when perl = TRUE

Wed Feb 14 04:51:37 CET 2024

Thanks Tim.
I confirm the proposed solution is over 10x faster, see
https://github.com/tdhock/atime/issues/29#issuecomment-1943037753 for
figure and source code.

On Mon, Jan 29, 2024 at 7:05 AM Tim Taylor
<tim.taylor using hiddenelephants.co.uk> wrote:
>
> I wanted to raise the possibility of improving strcapture performance in
> cases where perl = TRUE. I believe we can do this in a non-breaking way
> by calling regexpr instead of regexec (conditionally when perl = TRUE).
> To illustrate this I've put together a 'proof of concept' function called
> strcapture2 that utilises output from regexpr directly (following a very
> nice substring approach that I've seen implemented by Toby Hocking
> in the nc package - nc::capture_first_vec).
>
> strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes = FALSE) {
>     if (isTRUE(perl)) {
>         m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes = useBytes)
>         nomatch <- is.na(m) | m == -1L
>         ntokens <- length(proto)
>         if (any(!nomatch)) {
>             length <- attr(m, "match.length")
>             start <- attr(m, "capture.start")
>             length <- attr(m, "capture.length")
>             end <- start + length - 1L
>             end[nomatch, ] <- start[nomatch, ] <- NA
>             res <- substring(x, start, end)
>             out <- matrix(res, length(m))
>             if (ncol(out) != ntokens) {
>                 stop("The number of captures in 'pattern' != 'length(proto)'")
>             }
>         } else {
>             out <- matrix(NA_character_, length(m), ntokens)
>         }
>         utils:::conformToProto(out,proto)
>     } else {
>         strcapture(pattern,x,proto,perl,useBytes)
>     }
> }
>
> Now comparing with strcapture we can expand the named capture example
> from the grep documentation:
>
> notables <- c(
>     "  Ben Franklin and Jefferson Davis",
>     "\tMillard Fillmore",
>     "Bob",
>     NA_character_
> )
>
> regex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
> proto = data.frame("", "")
>
> (strcapture(regex, notables, proto, perl = TRUE))
>       X..    X...1
> 1     Ben Franklin
> 2 Millard Fillmore
> 3    <NA>     <NA>
> 4    <NA>     <NA>
>
> (strcapture2(regex, notables, proto, perl = TRUE))
>       X..    X...1
> 1     Ben Franklin
> 2 Millard Fillmore
> 3    <NA>     <NA>
> 4    <NA>     <NA>
>
> Now to compare timings over multiple reps:
>
> lengths <- sort(outer(c(1, 2, 5), 10^(1:4)))
> reps <- 20
>
> time_strcapture <- function(text, length, regex, proto, reps) {
>     text <- rep_len(text, length)
>     str <- system.time(for (i in seq_len(reps)) strcapture(regex, text, proto, perl = TRUE))
>     str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex, text, proto, perl = TRUE))
>     c(strcapture = str[["user.self"]], strcapture2 = str2[["user.self"]])
> }
> timings <- sapply(
>     lengths,
>     time_strcapture,
>     text = notables, regex = regex, reps = reps, proto = proto
> )
> cbind(lengths, t(timings))
>       lengths strcapture strcapture2
>  [1,]      10      0.005       0.003
>  [2,]      20      0.005       0.002
>  [3,]      50      0.008       0.003
>  [4,]     100      0.012       0.002
>  [5,]     200      0.021       0.003
>  [6,]     500      0.051       0.003
>  [7,]    1000      0.097       0.004
>  [8,]    2000      0.171       0.005
>  [9,]    5000      0.517       0.011
> [10,]   10000      1.203       0.018
> [11,]   20000      2.563       0.037
> [12,]   50000      7.276       0.090
>
> I've attached a plot of these timings in case helpful.
>
> I appreciate that changing strcapture in this way does make it more
> complicated but I think the performance improvements make it worth
> considering. Note that I've not thoroughly tested the above implementation
> as wanted to get feedback from the list before proceeding further.
>
> Hope all this make sense. Cheers
>
> Tim
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel