[R] extracting a matched string using regexpr

David Winsemius dwinsemius at comcast.net
Thu May 6 00:20:20 CEST 2010


On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:

> Here are two ways to extract 5 digits.
>
> In the first one \\1 refers to the portion matched between the
> parentheses in the regular expression.
>
> In the second one strapply is like apply where the object to be worked
> on is the first argument (array for apply, string for strapply) the
> second modifies it (which dimension for apply, regular expression for
> strapply) and the last is a function which acts on each value
> (typically each row or column for apply and each match for strapply).
> In this case we use c as our function to just return all the results.
> They are returned in a list with one component per string but here
> test is just a single string so we get a list one long and we ask for
> the contents of the first component using [[1]].
>
> # 1 - sub
> sub(".*(\\d{5}).*", "\\1", test)
 > test
[1] "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</ 
th><th>26m</th>"

I get different results than I expected given that "\\d" should be  
synonymous with "[0-9]":

 > sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"

 > sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>"

-- 
David.
>
> # 2 - strapply - see http://gsubfn.googlecode.com
> library(gsubfn)
> strapply(test, "\\d{5}", c)[[1]]
>
>
>
> On Wed, May 5, 2010 at 5:13 PM, steven mosher  
> <moshersteven at gmail.com> wrote:
>> Given a text like
>>
>> I want to be able to extract a matched regular expression from a  
>> piece of
>> text.
>>
>> this apparently works, but is pretty ugly
>> # some html
>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ 
>> th><th>68.9\nW</th><th>26m</th>"
>> # a pattern to extract 5 digits
>>> pattern<-"[0-9]{5}"
>> # regexpr returns a start point[1] and an attribute "match.length"
>> attr(,"match.length)
>> # get the substring from the start point to the stop point.. where  
>> stop =
>> start +length-1
>>>
>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test) 
>> [1]+attr(regexpr(pattern,test),"match.length")-1)
>>> answer
>> [1] "88958"
>>
>> I tried using sub(pattern, replacement, x )  with a regexp that  
>> captured the
>> group. I'd found an example of this in the mails
>> but it didnt seem to work..
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list