[R-SIG-Mac] Fwd: [R] extracting a matched string using regexpr Possible BUG

Simon Urbanek simon.urbanek at r-project.org
Thu May 6 17:28:52 CEST 2010


FWIW I don't think \d is a basic regexp so as I would expect the perl mode to work and it does:

> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345"

Yet I agree that if should either fail (i.e. return the unmodified string) or return 12345.

Also note that the bug is locale-specific:

LANG=C R

>  test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345"
> sub(".*(\\d{5}).*", "\\1", test2)
[1] "12345"

Also note that this is not Mac-specific:

>  test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>  sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
> system("uname -sr")
Linux 2.6.32-trunk-amd64
> Sys.getlocale("LC_CTYPE")
[1] "en_US.UTF-8"


Cheers,
Simon



On May 6, 2010, at 6:54 AM, David Winsemius wrote:

> 
> On May 6, 2010, at 2:21 AM, steven mosher wrote:
> 
>> see below,
>> 
>> using a regex in sub()  fails if the pattern is //d{5} and suceeds
>> if the pattern [0-9] {5} is used.. see the test cases below.
>> 
>> issue was not on windows machine and david and I had it on MAC.
> 
> Except we both were using \\d rather than //d.
> 
> I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.)
> 
> -- 
> David.
> 
> > sessionInfo()
> R version 2.10.1 RC (2009-12-09 r50695)
> x86_64-apple-darwin9.8.0
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] tcltk     stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] gsubfn_0.5-2   proto_0.3-8    zoo_1.6-3      SASxport_1.2.3 lattice_0.18-3
> 
> loaded via a namespace (and not attached):
> [1] chron_2.3-35 grid_2.10.1  tools_2.10.1
>> 
>> r11
>> 
>> mac os 10.5
>> 
>> ---------- Forwarded message ----------
>> From: steven mosher <moshersteven at gmail.com>
>> Date: Wed, May 5, 2010 at 3:25 PM
>> Subject: Re: [R] extracting a matched string using regexpr
>> To: David Winsemius <dwinsemius at comcast.net>
>> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help <
>> r-help at r-project.org>
>> 
>> 
>> with a fresh restart
>> 
>> 
>> 
>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>> 
>>> test
>> [1]
>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>> sub(".*(\\d{5}).*", "\\1", test)
>> [1] "</th>"
>>> sub(".*([0-9]{5}).*", "\\1", test)
>> [1] "88958"
>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>> sub(".*(\\d{5}).*", "\\1", test2)
>> [1] "WWWWW"
>>> 
>>> sub(".*(\\d{5}).*", "\\1", test2)
>> [1] "WWWWW"
>>> sub(".*([0-9]{5}).*", "\\1", test2)
>> [1] "12345"
>> 
>> 
>> Steve.
>> 
>> 
>> 
>> On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net>wrote:
>> 
>>> 
>>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:
>>> 
>>> Here are two ways to extract 5 digits.
>>>> 
>>>> In the first one \\1 refers to the portion matched between the
>>>> parentheses in the regular expression.
>>>> 
>>>> In the second one strapply is like apply where the object to be worked
>>>> on is the first argument (array for apply, string for strapply) the
>>>> second modifies it (which dimension for apply, regular expression for
>>>> strapply) and the last is a function which acts on each value
>>>> (typically each row or column for apply and each match for strapply).
>>>> In this case we use c as our function to just return all the results.
>>>> They are returned in a list with one component per string but here
>>>> test is just a single string so we get a list one long and we ask for
>>>> the contents of the first component using [[1]].
>>>> 
>>>> # 1 - sub
>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>> 
>>>> test
>>> [1]
>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>> 
>>> I get different results than I expected given that "\\d" should be
>>> synonymous with "[0-9]":
>>> 
>>> 
>>>> sub(".*([0-9]{5}).*", "\\1", test)
>>> [1] "88958"
>>> 
>>>> sub(".*(\\d{5}).*", "\\1", test)
>>> [1] "</th>"
>>> 
>>> --
>>> David.
>>> 
>>>> 
>>>> # 2 - strapply - see http://gsubfn.googlecode.com
>>>> library(gsubfn)
>>>> strapply(test, "\\d{5}", c)[[1]]
>>>> 
>>>> 
>>>> 
>>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com>
>>>> wrote:
>>>> 
>>>>> Given a text like
>>>>> 
>>>>> I want to be able to extract a matched regular expression from a piece of
>>>>> text.
>>>>> 
>>>>> this apparently works, but is pretty ugly
>>>>> # some html
>>>>> 
>>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>> # a pattern to extract 5 digits
>>>>> 
>>>>>> pattern<-"[0-9]{5}"
>>>>>> 
>>>>> # regexpr returns a start point[1] and an attribute "match.length"
>>>>> attr(,"match.length)
>>>>> # get the substring from the start point to the stop point.. where stop =
>>>>> start +length-1
>>>>> 
>>>>>> 
>>>>>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
>>>>> 
>>>>>> answer
>>>>>> 
>>>>> [1] "88958"
>>>>> 
>>>>> I tried using sub(pattern, replacement, x )  with a regexp that captured
>>>>> the
>>>>> group. I'd found an example of this in the mails
>>>>> but it didnt seem to work..
>>>>> 
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> 
>>> 
>>> David Winsemius, MD
>>> West Hartford, CT
>>> 
>>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
> 
> David Winsemius, MD
> West Hartford, CT
> 
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
> 
> 



More information about the R-SIG-Mac mailing list