[R] extracting a matched string using regexpr

Thu May 6 00:00:16 CEST 2010

That's not what I get:

> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
> sub(".*(\\d{5}).*", "\\1", test)
[1] "88958"
> R.version.string
[1] "R version 2.10.1 (2009-12-14)"

I also got the above in R 2.11.0 patched as well.

On Wed, May 5, 2010 at 5:55 PM, steven mosher <moshersteven at gmail.com> wrote:
>  test
> [1]
> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>> sub(".*(\\d{5}).*", "\\1", test)
> [1] "</th>"
>> sub(".*([0-9]{5}).*","\\1",test)
> [1] "88958"
>>
>
> I think the "</" in  the source throws something off.
> as the group capture appears to not be working, except the bracket version
> it did.
>
> On Wed, May 5, 2010 at 2:35 PM, Gabor Grothendieck <ggrothendieck at gmail.com>
> wrote:
>>
>> Here are two ways to extract 5 digits.
>>
>> In the first one \\1 refers to the portion matched between the
>> parentheses in the regular expression.
>>
>> In the second one strapply is like apply where the object to be worked
>> on is the first argument (array for apply, string for strapply) the
>> second modifies it (which dimension for apply, regular expression for
>> strapply) and the last is a function which acts on each value
>> (typically each row or column for apply and each match for strapply).
>> In this case we use c as our function to just return all the results.
>> They are returned in a list with one component per string but here
>> test is just a single string so we get a list one long and we ask for
>> the contents of the first component using [[1]].
>>
>> # 1 - sub
>> sub(".*(\\d{5}).*", "\\1", test)
>>
>> # 2 - strapply - see http://gsubfn.googlecode.com
>> library(gsubfn)
>> strapply(test, "\\d{5}", c)[[1]]
>>
>>
>>
>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com>
>> wrote:
>> > Given a text like
>> >
>> > I want to be able to extract a matched regular expression from a piece
>> > of
>> > text.
>> >
>> > this apparently works, but is pretty ugly
>> > # some html
>> >
>> > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>> > # a pattern to extract 5 digits
>> >> pattern<-"[0-9]{5}"
>> > # regexpr returns a start point[1] and an attribute "match.length"
>> > attr(,"match.length)
>> > # get the substring from the start point to the stop point.. where stop
>> > =
>> > start +length-1
>> >>
>> >
>> > answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
>> >> answer
>> > [1] "88958"
>> >
>> > I tried using sub(pattern, replacement, x )  with a regexp that captured
>> > the
>> > group. I'd found an example of this in the mails
>> > but it didnt seem to work..
>
>