[R] Regular expressions: offsets of groups

Michael Bedward michael.bedward at gmail.com
Wed Sep 29 13:58:56 CEST 2010


I'd definitely be a customer for it Titus. And it does seem like an
obvious hole in regex processing in R that cries out to be filled.

Um, ggregexpr isn't the sexiest of function names :)  Perhaps we can
think of something a little easier ?

How is your C coding ? Bill ? Anyone else ?  I could have a got at
writing some prototype code to test in the next few days, though if
someone else with decent C skills is itching to do it please speak up.

Michael

On 29 September 2010 20:08, Titus von der Malsburg <malsburg at gmail.com> wrote:
> Bill, Michael,
>
> good to see I'm not the only one who sees potential for improvements
> in the regexpr domain.  Adding a subpattern argument is certainly a
> step in the right direction and would make my life much easier.
> However, in my application I need to know not only the position of one
> group but also the position of the overall match in the original
> string.  The ideal solution would provide positions and match lengths
> for the whole pattern and for all groups if desired.  Only this would
> solve all related issues.  One possibility is to have a subpattern
> argument that accepts a vector of numbers (0 refers to the whole
> pattern):
>
>  > gregexpr("a+(b+)", "abcdaabbc", subpattern=c(0,1))
>  [[1]]:
>  [[1]][[1]]:
>  [1] 1 5
>  attr(, "match.length"):
>  [1] 2 4
>  [[1]][[2]]:
>  [1] 2 7
>  attr(, "match.length"):
>  [1] 1 2
>
> A weakness of this solution is that the structure of the return values
> changes if length(subpattern)>1.  An alternative is to have a separate
> function, say ggregepxr for group gregexpr, that returns a list of
> lists as in the above example.  This function would always return
> positions and match lengths of the whole pattern (group 0) and all
> groups.  The original gregexpr could still have the subpattern
> argument but it would only accept single numbers.  This way the return
> format of gregexpr remains the same.
>
> Best,
>
>  Titus
>
>
> On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
> <michael.bedward at gmail.com> wrote:
>> Ah, that's interesting - thanks Bill. That's certainly on the right
>> track for me (Titus, you too ?) especially if the subpattern argument
>> accepted a vector of multiple group indices.
>>
>> As you say, this is straightforward in C. I'd be happy to (try to)
>> make a patch for the R sources if there was some consensus on the best
>> way to implement it, ie. as a new R function or by extending existing
>> function(s).
>



More information about the R-help mailing list