[R] Regular expressions: offsets of groups
Michael Bedward
michael.bedward at gmail.com
Wed Sep 29 13:58:56 CEST 2010
I'd definitely be a customer for it Titus. And it does seem like an
obvious hole in regex processing in R that cries out to be filled.
Um, ggregexpr isn't the sexiest of function names :) Perhaps we can
think of something a little easier ?
How is your C coding ? Bill ? Anyone else ? I could have a got at
writing some prototype code to test in the next few days, though if
someone else with decent C skills is itching to do it please speak up.
Michael
On 29 September 2010 20:08, Titus von der Malsburg <malsburg at gmail.com> wrote:
> Bill, Michael,
>
> good to see I'm not the only one who sees potential for improvements
> in the regexpr domain. Adding a subpattern argument is certainly a
> step in the right direction and would make my life much easier.
> However, in my application I need to know not only the position of one
> group but also the position of the overall match in the original
> string. The ideal solution would provide positions and match lengths
> for the whole pattern and for all groups if desired. Only this would
> solve all related issues. One possibility is to have a subpattern
> argument that accepts a vector of numbers (0 refers to the whole
> pattern):
>
> > gregexpr("a+(b+)", "abcdaabbc", subpattern=c(0,1))
> [[1]]:
> [[1]][[1]]:
> [1] 1 5
> attr(, "match.length"):
> [1] 2 4
> [[1]][[2]]:
> [1] 2 7
> attr(, "match.length"):
> [1] 1 2
>
> A weakness of this solution is that the structure of the return values
> changes if length(subpattern)>1. An alternative is to have a separate
> function, say ggregepxr for group gregexpr, that returns a list of
> lists as in the above example. This function would always return
> positions and match lengths of the whole pattern (group 0) and all
> groups. The original gregexpr could still have the subpattern
> argument but it would only accept single numbers. This way the return
> format of gregexpr remains the same.
>
> Best,
>
> Titus
>
>
> On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
> <michael.bedward at gmail.com> wrote:
>> Ah, that's interesting - thanks Bill. That's certainly on the right
>> track for me (Titus, you too ?) especially if the subpattern argument
>> accepted a vector of multiple group indices.
>>
>> As you say, this is straightforward in C. I'd be happy to (try to)
>> make a patch for the R sources if there was some consensus on the best
>> way to implement it, ie. as a new R function or by extending existing
>> function(s).
>
More information about the R-help
mailing list