[R-SIG-Mac] Fwd: [R] extracting a matched string using regexpr Possible BUG

Thu May 6 17:50:22 CEST 2010

Two Q's:
A) Is this supposed to happen with perl-mode?:

 > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ 
th><th>68.9\nW</th><th>26m</th>"
 >
 > sub(".*(\\d{5}).*", "\\1", test, perl=TRUE)
[1] "88958\nW</th><th>26m</th>"
 >
 > sub(".*([0-9]{5}).*", "\\1", test, perl=TRUE)
[1] "88958\nW</th><th>26m</th>"

Looks to me that a period is being improperly recognized.

On May 6, 2010, at 11:28 AM, Simon Urbanek wrote:

> FWIW I don't think \d is a basic regexp

B) With regard to the default (which I read to be  extended rather  
than basic) vs. perl-like, the Extended section of the regex  
documentation contains:

" Symbols \d, \s, \D and \S denote the digit and space classes and  
their negations."

> so as I would expect the perl mode to work and it does:
>
>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
> [1] "12345"
>
> Yet I agree that if should either fail (i.e. return the unmodified  
> string) or return 12345.
>
> Also note that the bug is locale-specific:
>
> LANG=C R
>
>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
> [1] "12345"
>> sub(".*(\\d{5}).*", "\\1", test2)
> [1] "12345"
>
> Also note that this is not Mac-specific:
>
>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>> sub(".*(\\d{5}).*", "\\1", test2)
> [1] "WWWWW"
>> system("uname -sr")
> Linux 2.6.32-trunk-amd64
>> Sys.getlocale("LC_CTYPE")
> [1] "en_US.UTF-8"
>
>
> Cheers,
> Simon
>
>
>
> On May 6, 2010, at 6:54 AM, David Winsemius wrote:
>
>>
>> On May 6, 2010, at 2:21 AM, steven mosher wrote:
>>
>>> see below,
>>>
>>> using a regex in sub()  fails if the pattern is //d{5} and suceeds
>>> if the pattern [0-9] {5} is used.. see the test cases below.
>>>
>>> issue was not on windows machine and david and I had it on MAC.
>>
>> Except we both were using \\d rather than //d.
>>
>> I believe that Steve is using R 2.11.0 but I am still using R  
>> 2.10.1 (but with the release of an Hmisc upgrade I will convert  
>> soon.)
>>
>> -- 
>> David.
>>
>>> sessionInfo()
>> R version 2.10.1 RC (2009-12-09 r50695)
>> x86_64-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] tcltk     stats     graphics  grDevices utils     datasets   
>> methods   base
>>
>> other attached packages:
>> [1] gsubfn_0.5-2   proto_0.3-8    zoo_1.6-3      SASxport_1.2.3  
>> lattice_0.18-3
>>
>> loaded via a namespace (and not attached):
>> [1] chron_2.3-35 grid_2.10.1  tools_2.10.1
>>>
>>> r11
>>>
>>> mac os 10.5
>>>
>>> ---------- Forwarded message ----------
>>> From: steven mosher <moshersteven at gmail.com>
>>> Date: Wed, May 5, 2010 at 3:25 PM
>>> Subject: Re: [R] extracting a matched string using regexpr
>>> To: David Winsemius <dwinsemius at comcast.net>
>>> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help <
>>> r-help at r-project.org>
>>>
>>>
>>> with a fresh restart
>>>
>>>
>>>
>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ 
>>> th><th>68.9\nW</th><th>26m</th>"
>>>>
>>>> test
>>> [1]
>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</ 
>>> th><th>26m</th>"
>>>> sub(".*(\\d{5}).*", "\\1", test)
>>> [1] "</th>"
>>>> sub(".*([0-9]{5}).*", "\\1", test)
>>> [1] "88958"
>>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>>> sub(".*(\\d{5}).*", "\\1", test2)
>>> [1] "WWWWW"
>>>>
>>>> sub(".*(\\d{5}).*", "\\1", test2)
>>> [1] "WWWWW"
>>>> sub(".*([0-9]{5}).*", "\\1", test2)
>>> [1] "12345"
>>>
>>>
>>> Steve.
>>>
>>>
>>>
>>> On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net 
>>> >wrote:
>>>
>>>>
>>>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:
>>>>
>>>> Here are two ways to extract 5 digits.
>>>>>
>>>>> In the first one \\1 refers to the portion matched between the
>>>>> parentheses in the regular expression.
>>>>>
>>>>> In the second one strapply is like apply where the object to be  
>>>>> worked
>>>>> on is the first argument (array for apply, string for strapply)  
>>>>> the
>>>>> second modifies it (which dimension for apply, regular  
>>>>> expression for
>>>>> strapply) and the last is a function which acts on each value
>>>>> (typically each row or column for apply and each match for  
>>>>> strapply).
>>>>> In this case we use c as our function to just return all the  
>>>>> results.
>>>>> They are returned in a list with one component per string but here
>>>>> test is just a single string so we get a list one long and we  
>>>>> ask for
>>>>> the contents of the first component using [[1]].
>>>>>
>>>>> # 1 - sub
>>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>>>
>>>>> test
>>>> [1]
>>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ 
>>>> th><th>68.9\nW</th><th>26m</th>"
>>>>
>>>> I get different results than I expected given that "\\d" should be
>>>> synonymous with "[0-9]":
>>>>
>>>>
>>>>> sub(".*([0-9]{5}).*", "\\1", test)
>>>> [1] "88958"
>>>>
>>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>> [1] "</th>"
>>>>
>>>> --
>>>> David.
>>>>
>>>>>
>>>>> # 2 - strapply - see http://gsubfn.googlecode.com
>>>>> library(gsubfn)
>>>>> strapply(test, "\\d{5}", c)[[1]]
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com 
>>>>> >
>>>>> wrote:
>>>>>
>>>>>> Given a text like
>>>>>>
>>>>>> I want to be able to extract a matched regular expression from  
>>>>>> a piece of
>>>>>> text.
>>>>>>
>>>>>> this apparently works, but is pretty ugly
>>>>>> # some html
>>>>>>
>>>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ 
>>>>>> th><th>68.9\nW</th><th>26m</th>"
>>>>>> # a pattern to extract 5 digits
>>>>>>
>>>>>>> pattern<-"[0-9]{5}"
>>>>>>>
>>>>>> # regexpr returns a start point[1] and an attribute  
>>>>>> "match.length"
>>>>>> attr(,"match.length)
>>>>>> # get the substring from the start point to the stop point..  
>>>>>> where stop =
>>>>>> start +length-1
>>>>>>
>>>>>>>
>>>>>>> answer<-substr(test,regexpr(pattern,test) 
>>>>>>> [1],regexpr(pattern,test) 
>>>>>>> [1]+attr(regexpr(pattern,test),"match.length")-1)
>>>>>>
>>>>>>> answer
>>>>>>>
>>>>>> [1] "88958"
>>>>>>
>>>>>> I tried using sub(pattern, replacement, x )  with a regexp that  
>>>>>> captured
>>>>>> the
>>>>>> group. I'd found an example of this in the mails
>>>>>> but it didnt seem to work..
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>>
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> R-SIG-Mac at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>
>>
>

David Winsemius, MD
West Hartford, CT