[Rd] strcapture performance when perl = TRUE

Mon Jan 29 15:04:16 CET 2024

I wanted to raise the possibility of improving strcapture performance in
cases where perl = TRUE. I believe we can do this in a non-breaking way
by calling regexpr instead of regexec (conditionally when perl = TRUE).
To illustrate this I've put together a 'proof of concept' function called
strcapture2 that utilises output from regexpr directly (following a very
nice substring approach that I've seen implemented by Toby Hocking
in the nc package - nc::capture_first_vec).

strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes = FALSE) {
    if (isTRUE(perl)) {
        m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes = useBytes)
        nomatch <- is.na(m) | m == -1L
        ntokens <- length(proto)
        if (any(!nomatch)) {
            length <- attr(m, "match.length")
            start <- attr(m, "capture.start")
            length <- attr(m, "capture.length")
            end <- start + length - 1L
            end[nomatch, ] <- start[nomatch, ] <- NA
            res <- substring(x, start, end)
            out <- matrix(res, length(m))
            if (ncol(out) != ntokens) {
                stop("The number of captures in 'pattern' != 'length(proto)'")
            }
        } else {
            out <- matrix(NA_character_, length(m), ntokens)
        }
        utils:::conformToProto(out,proto)
    } else {
        strcapture(pattern,x,proto,perl,useBytes)
    }
}

Now comparing with strcapture we can expand the named capture example
from the grep documentation:

notables <- c(
    "  Ben Franklin and Jefferson Davis",
    "\tMillard Fillmore",
    "Bob",
    NA_character_
)

regex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
proto = data.frame("", "")

(strcapture(regex, notables, proto, perl = TRUE))
      X..    X...1
1     Ben Franklin
2 Millard Fillmore
3    <NA>     <NA>
4    <NA>     <NA>

(strcapture2(regex, notables, proto, perl = TRUE))
      X..    X...1
1     Ben Franklin
2 Millard Fillmore
3    <NA>     <NA>
4    <NA>     <NA>

Now to compare timings over multiple reps:

lengths <- sort(outer(c(1, 2, 5), 10^(1:4)))
reps <- 20 

time_strcapture <- function(text, length, regex, proto, reps) {
    text <- rep_len(text, length)
    str <- system.time(for (i in seq_len(reps)) strcapture(regex, text, proto, perl = TRUE))
    str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex, text, proto, perl = TRUE))
    c(strcapture = str[["user.self"]], strcapture2 = str2[["user.self"]])
}
timings <- sapply(
    lengths,
    time_strcapture,
    text = notables, regex = regex, reps = reps, proto = proto
)
cbind(lengths, t(timings))
      lengths strcapture strcapture2
 [1,]      10      0.005       0.003
 [2,]      20      0.005       0.002
 [3,]      50      0.008       0.003
 [4,]     100      0.012       0.002
 [5,]     200      0.021       0.003
 [6,]     500      0.051       0.003
 [7,]    1000      0.097       0.004
 [8,]    2000      0.171       0.005
 [9,]    5000      0.517       0.011
[10,]   10000      1.203       0.018
[11,]   20000      2.563       0.037
[12,]   50000      7.276       0.090

I've attached a plot of these timings in case helpful.

I appreciate that changing strcapture in this way does make it more
complicated but I think the performance improvements make it worth
considering. Note that I've not thoroughly tested the above implementation
as wanted to get feedback from the list before proceeding further.

Hope all this make sense. Cheers

Tim

-------------- next part --------------
A non-text attachment was scrubbed...
Name: strcapture.png
Type: image/png
Size: 28123 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20240129/261d61e7/attachment.png>