[Rd] Regular expressions & large strings (PR#6617)
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sat Feb 28 12:31:13 MET 2004
I was able to confirm the error on RH8.0 Linux and the segfault on
Windows.
Note that PCRE is not being used, and if you add perl=TRUE to your [g]sub
calls you get correct results extremely fast.
The segfault is occurring in regexec, that is in the GNU regex code
included in R. I am not clear it is worth spending any time on trying to
find the problem in that code as
- you can use perl=TRUE as an alternative
- we will be replacing the GNU regex code in due course to cope with
internationalization issues.
On Fri, 27 Feb 2004 mjw at celos.net wrote:
> A possible regex bug when working with large strings. The
> following code snippet
>
> t5 <- paste( c( "# === TEST", rep(' ', 2452294) ), collapse='')
> str( sub("^.*TEST", "xyz", t5) )
> str( sub("^.*TEST", "xyz", substr(t5,0,200)) )
>
> doesn't behave right; on one machine, the second and third
> lines print different results [the second line, on the long
> string, doesn't do the substitution], while on another, the
> second line causes a segfault. Both are running R 1.8.1
> with PCRE, under NetBSD (1.6.1 and 1.6 respectively).
>
> Possible related (although perhaps not a bug):
>
> function(n) {
> line <- paste(as.character(trunc(runif(n)*100)),collapse=" ")
> system.time( rep <- gsub("[[:space:]]", "-", line) )
> }
>
> gives rather long times rising v sharply for big strings (eg
> 2.2s at n=2e4, 360s at n=2e5 on AMD 1.2GHz). Other languages
> aren't so slow on this task (eg n=2e5: 0.4s ruby 1.8.1, and
> 5.2s python 2). Doubtless my extremely-quick-hack benchmarks
> aren't fair, but the difference still seems rather big.
>
> Mark <><
>
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list