[R] Regular expressions: offsets of groups

Titus von der Malsburg malsburg at gmail.com
Wed Sep 29 12:08:49 CEST 2010


Bill, Michael,

good to see I'm not the only one who sees potential for improvements
in the regexpr domain.  Adding a subpattern argument is certainly a
step in the right direction and would make my life much easier.
However, in my application I need to know not only the position of one
group but also the position of the overall match in the original
string.  The ideal solution would provide positions and match lengths
for the whole pattern and for all groups if desired.  Only this would
solve all related issues.  One possibility is to have a subpattern
argument that accepts a vector of numbers (0 refers to the whole
pattern):

  > gregexpr("a+(b+)", "abcdaabbc", subpattern=c(0,1))
 [[1]]:
 [[1]][[1]]:
 [1] 1 5
 attr(, "match.length"):
 [1] 2 4
 [[1]][[2]]:
 [1] 2 7
 attr(, "match.length"):
 [1] 1 2

A weakness of this solution is that the structure of the return values
changes if length(subpattern)>1.  An alternative is to have a separate
function, say ggregepxr for group gregexpr, that returns a list of
lists as in the above example.  This function would always return
positions and match lengths of the whole pattern (group 0) and all
groups.  The original gregexpr could still have the subpattern
argument but it would only accept single numbers.  This way the return
format of gregexpr remains the same.

Best,

  Titus


On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
<michael.bedward at gmail.com> wrote:
> Ah, that's interesting - thanks Bill. That's certainly on the right
> track for me (Titus, you too ?) especially if the subpattern argument
> accepted a vector of multiple group indices.
>
> As you say, this is straightforward in C. I'd be happy to (try to)
> make a patch for the R sources if there was some consensus on the best
> way to implement it, ie. as a new R function or by extending existing
> function(s).



More information about the R-help mailing list