[R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Sat Dec 13 18:03:17 CET 2008

Wacek,

 I am curious as to why Brian and I (and possibly other responders) are held to a higher standard than the original poster.

My first question was a real question.  There are 2 main ways to do regular expression matching (possibly others as well), you describe one method with the pointer moving through the string to be matched, the other way moves the pointer through the pattern looking for all possible matches in the string that match so far (one method is DFA the other NFA, but I don't remember which is which and my copy of Friedl is at work and I'm not).  Perl and PCRE use the method that you describe, but the other method may be more efficient for finding all overlapping matches in some cases, but as far as I know (and there are plenty of things I don't know, including all the changes/new programs since I last read about this) the only programs that use the other type of matching don't return the actual matches, just a yes/no on at least one match.  So if the original poster has formed his opinion based on another program that uses that type of matching, I would be interested to know about it.  It would teach me something and also make it easier for all of us to better help the original poster in the future if we understand where he is coming from.

I did consider giving more of an explanation, but I assumed that the original poster was intelligent enough to learn the details himself given a few pointers and also figured that while learning about his specific problem, he may learn something else as well.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: Wacek Kusnierczyk [mailto:Waclaw.Marcin.Kusnierczyk at idi.ntnu.no]
> Sent: Friday, December 12, 2008 4:08 PM
> To: Greg Snow
> Cc: R help
> Subject: Re: [Rd] gregexpr - match overlap mishandled (PR#13391)
>
> Greg Snow wrote:
> > Where do you get "should" and "expect" from?  All the regular
> expression tools that I am familiar with only match non-overlapping
> patterns unless you do extra to specify otherwise.  One of the standard
> references for regular expressions if you really want to understand
> what is going on is "Mastering Regular Expressions" by Jeffrey Friedl.
> You should really read through that book before passing judgment on the
> correctness of an implementation.
> >
> > If you want the overlaps, you need to come up with a regular
> expression that will match without consuming all of the string.  Here
> is one way to do it with your example:
> >
> >  > gregexpr("1122(?=1122)", paste(rep("1122", 10), collapse=""),
> perl=TRUE)
> > [[1]]
> > [1]  1  5  9 13 17 21 25 29 33
> > attr(,"match.length")
> > [1] 4 4 4 4 4 4 4 4 4
> >
> >
>
> another option would be to move the anchor backwards after each match,
> but i'm not sure if the problem really needs it and if it could be done
> from within r.
>
> greg (and another person who answered this post earlier):
> while your frustration is understandable, i think reid (and possibly
> other users as well) would benefit from a brief explanation instead of
> your emotional reactions.  you ought to be more patient and less
> arrogant with newbies who will often think there is a bug in r when
> there isn't.
>
> reid:
> when matching is performed, there is a pointer moved through the
> string.  in global matching, after a match is found the pointer is just
> behind the matched substring, and further matching proceeds from there.
> for example example, suppose you match "aaa" (the string) with "aa"
> (the
> pattern) globally.  after the first successful match, the position
> pointer is *behind the second a* in the string, and no further match
> can
> be found from there.in this context, 'global' does not mean that all
> possible matches are found, rather that matching is performed
> iteratively.
>
> the above is probably a solution to your problem, though the matches
> have length 4, not 8.  in perl, you could manually move back the anchor
> after each match, e.g.:
>
> $string = "1122" x 10;
> $n = length($string)/2;
> @matches = ();
> $string =~ /11221122(??{push @matches [$-[0], $&]; pos($s) -= $n})/g;
>
> now @matches has 9 elements, each a ref to an array with the starting
> position and the content (of length 8) of the respective match:
> @matches = ([0, "11221122"], [4, "11221122"], ...)
>
> not sure if you can do this within r.  not sure if you'll ever need it.
> for more complex cases when you need overlapping matches and you need
> their content, greg's solution might not do, but in general that's the
> solution.
>
> vQ