[R-SIG-Mac] Fwd: [R] extracting a matched string using regexpr Possible BUG

Thu May 6 12:54:36 CEST 2010

On May 6, 2010, at 2:21 AM, steven mosher wrote:

> see below,
>
> using a regex in sub()  fails if the pattern is //d{5} and suceeds
> if the pattern [0-9] {5} is used.. see the test cases below.
>
> issue was not on windows machine and david and I had it on MAC.

Except we both were using \\d rather than //d.

I believe that Steve is using R 2.11.0 but I am still using R 2.10.1  
(but with the release of an Hmisc upgrade I will convert soon.)

-- 
David.

 > sessionInfo()
R version 2.10.1 RC (2009-12-09 r50695)
x86_64-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets   
methods   base

other attached packages:
[1] gsubfn_0.5-2   proto_0.3-8    zoo_1.6-3      SASxport_1.2.3  
lattice_0.18-3

loaded via a namespace (and not attached):
[1] chron_2.3-35 grid_2.10.1  tools_2.10.1
>
> r11
>
> mac os 10.5
>
> ---------- Forwarded message ----------
> From: steven mosher <moshersteven at gmail.com>
> Date: Wed, May 5, 2010 at 3:25 PM
> Subject: Re: [R] extracting a matched string using regexpr
> To: David Winsemius <dwinsemius at comcast.net>
> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help <
> r-help at r-project.org>
>
>
> with a fresh restart
>
>
>
> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ 
> th><th>68.9\nW</th><th>26m</th>"
>>
>> test
> [1]
> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</ 
> th><th>26m</th>"
>> sub(".*(\\d{5}).*", "\\1", test)
> [1] "</th>"
>> sub(".*([0-9]{5}).*", "\\1", test)
> [1] "88958"
>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>> sub(".*(\\d{5}).*", "\\1", test2)
> [1] "WWWWW"
>>
>> sub(".*(\\d{5}).*", "\\1", test2)
> [1] "WWWWW"
>> sub(".*([0-9]{5}).*", "\\1", test2)
> [1] "12345"
>
>
> Steve.
>
>
>
> On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net 
> >wrote:
>
>>
>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:
>>
>> Here are two ways to extract 5 digits.
>>>
>>> In the first one \\1 refers to the portion matched between the
>>> parentheses in the regular expression.
>>>
>>> In the second one strapply is like apply where the object to be  
>>> worked
>>> on is the first argument (array for apply, string for strapply) the
>>> second modifies it (which dimension for apply, regular expression  
>>> for
>>> strapply) and the last is a function which acts on each value
>>> (typically each row or column for apply and each match for  
>>> strapply).
>>> In this case we use c as our function to just return all the  
>>> results.
>>> They are returned in a list with one component per string but here
>>> test is just a single string so we get a list one long and we ask  
>>> for
>>> the contents of the first component using [[1]].
>>>
>>> # 1 - sub
>>> sub(".*(\\d{5}).*", "\\1", test)
>>>
>>> test
>> [1]
>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</ 
>> th><th>26m</th>"
>>
>> I get different results than I expected given that "\\d" should be
>> synonymous with "[0-9]":
>>
>>
>>> sub(".*([0-9]{5}).*", "\\1", test)
>> [1] "88958"
>>
>>> sub(".*(\\d{5}).*", "\\1", test)
>> [1] "</th>"
>>
>> --
>> David.
>>
>>>
>>> # 2 - strapply - see http://gsubfn.googlecode.com
>>> library(gsubfn)
>>> strapply(test, "\\d{5}", c)[[1]]
>>>
>>>
>>>
>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com 
>>> >
>>> wrote:
>>>
>>>> Given a text like
>>>>
>>>> I want to be able to extract a matched regular expression from a  
>>>> piece of
>>>> text.
>>>>
>>>> this apparently works, but is pretty ugly
>>>> # some html
>>>>
>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ 
>>>> th><th>68.9\nW</th><th>26m</th>"
>>>> # a pattern to extract 5 digits
>>>>
>>>>> pattern<-"[0-9]{5}"
>>>>>
>>>> # regexpr returns a start point[1] and an attribute "match.length"
>>>> attr(,"match.length)
>>>> # get the substring from the start point to the stop point..  
>>>> where stop =
>>>> start +length-1
>>>>
>>>>>
>>>>> answer<-substr(test,regexpr(pattern,test) 
>>>>> [1],regexpr(pattern,test) 
>>>>> [1]+attr(regexpr(pattern,test),"match.length")-1)
>>>>
>>>>> answer
>>>>>
>>>> [1] "88958"
>>>>
>>>> I tried using sub(pattern, replacement, x )  with a regexp that  
>>>> captured
>>>> the
>>>> group. I'd found an example of this in the mails
>>>> but it didnt seem to work..
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac

David Winsemius, MD
West Hartford, CT