[Rd] gregexpr - match overlap mishandled (PR#13391)

Greg Snow Greg.Snow at imail.org
Fri Dec 12 22:32:56 CET 2008


Where do you get "should" and "expect" from?  All the regular expression tools that I am familiar with only match non-overlapping patterns unless you do extra to specify otherwise.  One of the standard references for regular expressions if you really want to understand what is going on is "Mastering Regular Expressions" by Jeffrey Friedl.  You should really read through that book before passing judgment on the correctness of an implementation.

If you want the overlaps, you need to come up with a regular expression that will match without consuming all of the string.  Here is one way to do it with your example:

 > gregexpr("1122(?=1122)", paste(rep("1122", 10), collapse=""), perl=TRUE)
[[1]]
[1]  1  5  9 13 17 21 25 29 33
attr(,"match.length")
[1] 4 4 4 4 4 4 4 4 4



--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> project.org] On Behalf Of rthompso at aecom.yu.edu
> Sent: Friday, December 12, 2008 10:05 AM
> To: r-devel at stat.math.ethz.ch
> Cc: R-bugs at r-project.org
> Subject: [Rd] gregexpr - match overlap mishandled (PR#13391)
>
> Full_Name: Reid Thompson
> Version: 2.8.0 RC (2008-10-12 r46696)
> OS: darwin9.5.0
> Submission from: (NULL) (129.98.107.177)
>
>
> the gregexpr() function does NOT return a complete list of global
> matches as it
> should.  this occurs when a pattern matches two overlapping portions of
> a
> string, only the first match is returned.
>
> the following function call demonstrates this error (although this is
> not how I
> initially discovered the problem):
> gregexpr("11221122", paste(rep("1122", 10), collapse=""))
>
> instead of returning 9 matches as one would expect, only 5 matches are
> returned
> . . .
>
> [[1]]
> [1]  1  9 17 25 33
> attr(,"match.length")
> [1] 8 8 8 8 8
>
> you will note, essentially, that the entire first match is then
> excluded from
> subsequent matching
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list