[Rd] strsplit(perl=TRUE), gregexpr(perl=TRUE) very slow for long strings
William Dunlap
wdunlap at tibco.com
Fri Jan 6 06:15:17 CET 2017
While doing some speed testing I noticed that in R-3.2.3 the perl=TRUE
variants of strsplit() and gregexpr() took time proportional to the
square of the number of pattern matches in their input strings. E.g.,
the attached test function times gsub, strsplit, and gregexpr, with
perl TRUE (PCRE) and FALSE (TRE), when the input string contains 'n'
matches to the given pattern. Notice the quadratic (in n) time growth
for the StrSplitPCRE and RegExprPCRE columns.
> regex.perf.test(N=2^(11:20))
N SubTRE SubPCRE StrSplitTRE StrSplitPCRE RegExprTRE RegExprPCRE
elapsed 2048 0.00 0.00 0.00 0.00 0.00 0.00
elapsed 4096 0.00 0.00 0.01 0.00 0.00 0.00
elapsed 8192 0.00 0.00 0.00 0.01 0.00 0.01
elapsed 16384 0.02 0.00 0.00 0.05 0.02 0.08
elapsed 32768 0.00 0.00 0.01 0.16 0.00 0.29
elapsed 65536 0.02 0.01 0.04 0.59 0.01 0.96
elapsed 131072 0.03 0.02 0.08 2.37 0.05 2.43
elapsed 262144 0.06 0.04 0.17 9.58 0.10 9.61
elapsed 524288 0.14 0.05 0.36 39.14 0.21 38.58
elapsed 1048576 0.30 0.08 0.52 155.50 0.40 155.43
I have not looked at R's code, but it is possible that the problem is
caused by PCRE repeatedly scanning (once per match) the entire input
string to make sure it is valid UTF-8. If so, adding
PCRE_NO_UTF8_CHECK to the flags given to pcre_exec would solve the
problem. Perhaps R is already doing that in gsub(perl=TRUE).
Here is the test function:
regex.perf.test <- function(N=c(1e4, 2e4, 4e4, 8e4)) {
makeTestString <- function(n) paste(collapse="", rep("ab", n))
s <- lapply(N, makeTestString)
fns <- list(SubTRE=function(si) gsub("a", "", si, perl=FALSE),
SubPCRE=function(si) gsub("a", "", si, perl=TRUE),
StrSplitTRE=function(si) strsplit(si, "a", perl=FALSE),
StrSplitPCRE=function(si) strsplit(si, "a", perl=TRUE),
RegExprTRE=function(si) gregexpr("a", si, perl=FALSE),
RegExprPCRE=function(si) gregexpr("a", si, perl=TRUE))
times <- lapply(fns, function(fn) sapply(s, function(si)
system.time(fn(si))["elapsed"]))
do.call("cbind", c(list(N=N), times))
}
Bill Dunlap
TIBCO Software
wdunlap tibco.com
More information about the R-devel
mailing list