[Rd] gregexpr - match overlap mishandled (PR#13391)

Greg.Snow at imail.org Greg.Snow at imail.org
Fri Dec 12 22:35:26 CET 2008


Where do you get "should" and "expect" from?  All the regular expression to=
ols that I am familiar with only match non-overlapping patterns unless you =
do extra to specify otherwise.  One of the standard references for regular =
expressions if you really want to understand what is going on is "Mastering=
 Regular Expressions" by Jeffrey Friedl.  You should really read through th=
at book before passing judgment on the correctness of an implementation.

If you want the overlaps, you need to come up with a regular expression tha=
t will match without consuming all of the string.  Here is one way to do it=
 with your example:

 > gregexpr("1122(?=3D1122)", paste(rep("1122", 10), collapse=3D""), perl=
=3DTRUE)
[[1]]
[1]  1  5  9 13 17 21 25 29 33
attr(,"match.length")
[1] 4 4 4 4 4 4 4 4 4



--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> project.org] On Behalf Of rthompso at aecom.yu.edu
> Sent: Friday, December 12, 2008 10:05 AM
> To: r-devel at stat.math.ethz.ch
> Cc: R-bugs at r-project.org
> Subject: [Rd] gregexpr - match overlap mishandled (PR#13391)
>
> Full_Name: Reid Thompson
> Version: 2.8.0 RC (2008-10-12 r46696)
> OS: darwin9.5.0
> Submission from: (NULL) (129.98.107.177)
>
>
> the gregexpr() function does NOT return a complete list of global
> matches as it
> should.  this occurs when a pattern matches two overlapping portions of
> a
> string, only the first match is returned.
>
> the following function call demonstrates this error (although this is
> not how I
> initially discovered the problem):
> gregexpr("11221122", paste(rep("1122", 10), collapse=3D""))
>
> instead of returning 9 matches as one would expect, only 5 matches are
> returned
> . . .
>
> [[1]]
> [1]  1  9 17 25 33
> attr(,"match.length")
> [1] 8 8 8 8 8
>
> you will note, essentially, that the entire first match is then
> excluded from
> subsequent matching
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list