[R-SIG-Mac] Fwd: [R] extracting a matched string using regexpr Possible BUG
Simon Urbanek
simon.urbanek at r-project.org
Thu May 6 17:28:52 CEST 2010
FWIW I don't think \d is a basic regexp so as I would expect the perl mode to work and it does:
> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345"
Yet I agree that if should either fail (i.e. return the unmodified string) or return 12345.
Also note that the bug is locale-specific:
LANG=C R
> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345"
> sub(".*(\\d{5}).*", "\\1", test2)
[1] "12345"
Also note that this is not Mac-specific:
> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
> sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
> system("uname -sr")
Linux 2.6.32-trunk-amd64
> Sys.getlocale("LC_CTYPE")
[1] "en_US.UTF-8"
Cheers,
Simon
On May 6, 2010, at 6:54 AM, David Winsemius wrote:
>
> On May 6, 2010, at 2:21 AM, steven mosher wrote:
>
>> see below,
>>
>> using a regex in sub() fails if the pattern is //d{5} and suceeds
>> if the pattern [0-9] {5} is used.. see the test cases below.
>>
>> issue was not on windows machine and david and I had it on MAC.
>
> Except we both were using \\d rather than //d.
>
> I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.)
>
> --
> David.
>
> > sessionInfo()
> R version 2.10.1 RC (2009-12-09 r50695)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] tcltk stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3 lattice_0.18-3
>
> loaded via a namespace (and not attached):
> [1] chron_2.3-35 grid_2.10.1 tools_2.10.1
>>
>> r11
>>
>> mac os 10.5
>>
>> ---------- Forwarded message ----------
>> From: steven mosher <moshersteven at gmail.com>
>> Date: Wed, May 5, 2010 at 3:25 PM
>> Subject: Re: [R] extracting a matched string using regexpr
>> To: David Winsemius <dwinsemius at comcast.net>
>> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help <
>> r-help at r-project.org>
>>
>>
>> with a fresh restart
>>
>>
>>
>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>
>>> test
>> [1]
>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>> sub(".*(\\d{5}).*", "\\1", test)
>> [1] "</th>"
>>> sub(".*([0-9]{5}).*", "\\1", test)
>> [1] "88958"
>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>> sub(".*(\\d{5}).*", "\\1", test2)
>> [1] "WWWWW"
>>>
>>> sub(".*(\\d{5}).*", "\\1", test2)
>> [1] "WWWWW"
>>> sub(".*([0-9]{5}).*", "\\1", test2)
>> [1] "12345"
>>
>>
>> Steve.
>>
>>
>>
>> On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net>wrote:
>>
>>>
>>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:
>>>
>>> Here are two ways to extract 5 digits.
>>>>
>>>> In the first one \\1 refers to the portion matched between the
>>>> parentheses in the regular expression.
>>>>
>>>> In the second one strapply is like apply where the object to be worked
>>>> on is the first argument (array for apply, string for strapply) the
>>>> second modifies it (which dimension for apply, regular expression for
>>>> strapply) and the last is a function which acts on each value
>>>> (typically each row or column for apply and each match for strapply).
>>>> In this case we use c as our function to just return all the results.
>>>> They are returned in a list with one component per string but here
>>>> test is just a single string so we get a list one long and we ask for
>>>> the contents of the first component using [[1]].
>>>>
>>>> # 1 - sub
>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>>
>>>> test
>>> [1]
>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>
>>> I get different results than I expected given that "\\d" should be
>>> synonymous with "[0-9]":
>>>
>>>
>>>> sub(".*([0-9]{5}).*", "\\1", test)
>>> [1] "88958"
>>>
>>>> sub(".*(\\d{5}).*", "\\1", test)
>>> [1] "</th>"
>>>
>>> --
>>> David.
>>>
>>>>
>>>> # 2 - strapply - see http://gsubfn.googlecode.com
>>>> library(gsubfn)
>>>> strapply(test, "\\d{5}", c)[[1]]
>>>>
>>>>
>>>>
>>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com>
>>>> wrote:
>>>>
>>>>> Given a text like
>>>>>
>>>>> I want to be able to extract a matched regular expression from a piece of
>>>>> text.
>>>>>
>>>>> this apparently works, but is pretty ugly
>>>>> # some html
>>>>>
>>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>> # a pattern to extract 5 digits
>>>>>
>>>>>> pattern<-"[0-9]{5}"
>>>>>>
>>>>> # regexpr returns a start point[1] and an attribute "match.length"
>>>>> attr(,"match.length)
>>>>> # get the substring from the start point to the stop point.. where stop =
>>>>> start +length-1
>>>>>
>>>>>>
>>>>>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
>>>>>
>>>>>> answer
>>>>>>
>>>>> [1] "88958"
>>>>>
>>>>> I tried using sub(pattern, replacement, x ) with a regexp that captured
>>>>> the
>>>>> group. I'd found an example of this in the mails
>>>>> but it didnt seem to work..
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>
> David Winsemius, MD
> West Hartford, CT
>
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>
>
More information about the R-SIG-Mac
mailing list