[R] Regular expressions: offsets of groups

Tue Sep 28 17:46:34 CEST 2010

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Michael Bedward
> Sent: Tuesday, September 28, 2010 12:46 AM
> To: Titus von der Malsburg
> Cc: r-help at r-project.org
> Subject: Re: [R] Regular expressions: offsets of groups
> 
> What Titus wants to do is akin to retrieving capturing groups from a
> Matcher object in Java. I also thought there must be an existing,
> elegant solution to this some time ago and searched for it, including
> looking at the sources (albeit with not much expertise) but came up
> blank.
> 
> I also looked at the stringr package (which is nice) but it doesn't
> quite do it either.

S+ has a subpattern=number argument to regexpr and
related functions.  It means that the text matched
by the subpattern'th parenthesized expression in the
pattern will be considered the matched text.  E.g.,
to find runs of b's that come immediately after a's:

  > gregexpr("a+(b+)", "abcdaabbc", subpattern=1)
  [[1]]:
  [1] 2 7
  attr(, "match.length"):
  [1] 1 2

or to find bc's that come after 2 or more ab's
  > gregexpr("(ab){2,}bc", "abbcabababbcabcababbc", subpattern=1)

regexpr() and strsplit() have this argument in S+ 8.1 but
gregexpr() is not yet in a released version of S+.

subpattern=0, the default, means to use the entire
pattern.  regexpr allows subpattern=-1, which means
to return a list with one element for each subpattern.
I don't know if the extra complexity is worth it.
(gregexpr does not allow subpattern=-1.)

The usual C regexec() returns this information.
Perhaps it would be handy to have it in R.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> Michael
> 
> On 28 September 2010 01:48, Titus von der Malsburg 
> <malsburg at gmail.com> wrote:
> > Dear list!
> >
> >> gregexpr("a+(b+)", "abcdaabbc")
> > [[1]]
> > [1] 1 5
> > attr(,"match.length")
> > [1] 2 4
> >
> > What I want is the offsets of the matches for the group (b+), i.e. 2
> > and 7, not the offsets of the complete matches.  Is there a way in R
> > to get that?
> >
> > I know about gsubgn and strapply, but they only give me the strings
> > matched by groups not their offsets.
> >
> > I could write something myself that first takes the above matches
> > ("ab" and "aabb") and then searches again using only the group (b+).
> > For this to work, I'd have to parse the regular expression 
> and search
> > several times (> 2, for nested groups) instead of just 
> once.  But I'm
> > sure there is a better way to do this.
> >
> > Thanks for any suggestion!
> >
> >   Titus
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>