[R] extracting a matched string using regexpr

Gabor Grothendieck ggrothendieck at gmail.com
Thu May 6 04:31:19 CEST 2010


Yes, you could bring it up on the R-sig-mac or file a bug report.

On Wed, May 5, 2010 at 10:11 PM, steven mosher <moshersteven at gmail.com> wrote:
> Thnks,
> perhaps we should report it
>
> On Wed, May 5, 2010 at 4:52 PM, Gabor Grothendieck <ggrothendieck at gmail.com>
> wrote:
>>
>> I am using Vista.  Another thing to try is strapply using the tcl
>> engine (assuming you do have tcltk capabilities) and the R engine.  On
>> Vista R 2.11.0 patched I get the same result:
>>
>> > capabilities()[["tcltk"]]
>> [1] TRUE
>> > strapply(test, "\\d{5}", c, engine = "tcl")[[1]]
>> [1] "88958"
>> > strapply(test, "\\d{5}", c, engine = "R")[[1]]
>> [1] "88958"
>>
>> On Vista with R 2.9.2 I do get bad results:
>>
>> >
>> > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>> > sub(".*(\\d{5}).*", "\\1", test)
>> [1]
>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>> > sub(".*(\\d{5}).*", "\\1", test, extended = TRUE)
>> [1]
>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>> > R.version.string
>> [1] "R version 2.9.2 Patched (2009-09-08 r49647)"
>> > win.version()
>> [1] "Windows Vista (build 6002) Service Pack 2"
>>
>>
>> On Wed, May 5, 2010 at 6:20 PM, steven mosher <moshersteven at gmail.com>
>> wrote:
>> > Hmm.
>> > I have R11 just downloaded fresh.
>> > I'll reload a new session..and revert. I will note that I've had trouble
>> > with \\d
>> > which is why I was using [0-9]
>> > MAC here.
>> >
>> > On Wed, May 5, 2010 at 3:00 PM, Gabor Grothendieck
>> > <ggrothendieck at gmail.com>
>> > wrote:
>> >>
>> >> That's not what I get:
>> >>
>> >> >
>> >> >
>> >> > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>> >> > sub(".*(\\d{5}).*", "\\1", test)
>> >> [1] "88958"
>> >> > R.version.string
>> >> [1] "R version 2.10.1 (2009-12-14)"
>> >>
>> >> I also got the above in R 2.11.0 patched as well.
>> >>
>> >>
>> >> On Wed, May 5, 2010 at 5:55 PM, steven mosher <moshersteven at gmail.com>
>> >> wrote:
>> >> >  test
>> >> > [1]
>> >> >
>> >> >
>> >> > "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>> >> >> sub(".*(\\d{5}).*", "\\1", test)
>> >> > [1] "</th>"
>> >> >> sub(".*([0-9]{5}).*","\\1",test)
>> >> > [1] "88958"
>> >> >>
>> >> >
>> >> > I think the "</" in  the source throws something off.
>> >> > as the group capture appears to not be working, except the bracket
>> >> > version
>> >> > it did.
>> >> >
>> >> > On Wed, May 5, 2010 at 2:35 PM, Gabor Grothendieck
>> >> > <ggrothendieck at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Here are two ways to extract 5 digits.
>> >> >>
>> >> >> In the first one \\1 refers to the portion matched between the
>> >> >> parentheses in the regular expression.
>> >> >>
>> >> >> In the second one strapply is like apply where the object to be
>> >> >> worked
>> >> >> on is the first argument (array for apply, string for strapply) the
>> >> >> second modifies it (which dimension for apply, regular expression
>> >> >> for
>> >> >> strapply) and the last is a function which acts on each value
>> >> >> (typically each row or column for apply and each match for
>> >> >> strapply).
>> >> >> In this case we use c as our function to just return all the
>> >> >> results.
>> >> >> They are returned in a list with one component per string but here
>> >> >> test is just a single string so we get a list one long and we ask
>> >> >> for
>> >> >> the contents of the first component using [[1]].
>> >> >>
>> >> >> # 1 - sub
>> >> >> sub(".*(\\d{5}).*", "\\1", test)
>> >> >>
>> >> >> # 2 - strapply - see http://gsubfn.googlecode.com
>> >> >> library(gsubfn)
>> >> >> strapply(test, "\\d{5}", c)[[1]]
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, May 5, 2010 at 5:13 PM, steven mosher
>> >> >> <moshersteven at gmail.com>
>> >> >> wrote:
>> >> >> > Given a text like
>> >> >> >
>> >> >> > I want to be able to extract a matched regular expression from a
>> >> >> > piece
>> >> >> > of
>> >> >> > text.
>> >> >> >
>> >> >> > this apparently works, but is pretty ugly
>> >> >> > # some html
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>> >> >> > # a pattern to extract 5 digits
>> >> >> >> pattern<-"[0-9]{5}"
>> >> >> > # regexpr returns a start point[1] and an attribute "match.length"
>> >> >> > attr(,"match.length)
>> >> >> > # get the substring from the start point to the stop point.. where
>> >> >> > stop
>> >> >> > =
>> >> >> > start +length-1
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
>> >> >> >> answer
>> >> >> > [1] "88958"
>> >> >> >
>> >> >> > I tried using sub(pattern, replacement, x )  with a regexp that
>> >> >> > captured
>> >> >> > the
>> >> >> > group. I'd found an example of this in the mails
>> >> >> > but it didnt seem to work..
>> >> >
>> >> >
>> >
>> >
>
>



More information about the R-help mailing list