[R-SIG-Mac] Fwd: [R] extracting a matched string using regexpr Possible BUG
Simon Urbanek
simon.urbanek at r-project.org
Thu May 6 18:10:45 CEST 2010
On May 6, 2010, at 11:50 AM, David Winsemius wrote:
> Two Q's:
> A) Is this supposed to happen with perl-mode?:
>
> > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
> >
> > sub(".*(\\d{5}).*", "\\1", test, perl=TRUE)
> [1] "88958\nW</th><th>26m</th>"
> >
> > sub(".*([0-9]{5}).*", "\\1", test, perl=TRUE)
> [1] "88958\nW</th><th>26m</th>"
>
Nope - perl does take EOL into account so .* will be matched only to the end of line. For your purposes you want to enable ?s option, so you probably meant:
> sub("(?s).*(\\d{5}).*", "\\1", test, perl=TRUE)
[1] "88958"
> Looks to me that a period is being improperly recognized.
>
> On May 6, 2010, at 11:28 AM, Simon Urbanek wrote:
>
>> FWIW I don't think \d is a basic regexp
>
> B) With regard to the default (which I read to be extended rather than basic) vs. perl-like, the Extended section of the regex documentation contains:
>
> " Symbols \d, \s, \D and \S denote the digit and space classes and their negations."
>
Yes, you're right - extended is the default.
Cheers,
Simon
>> so as I would expect the perl mode to work and it does:
>>
>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
>> [1] "12345"
>>
>> Yet I agree that if should either fail (i.e. return the unmodified string) or return 12345.
>>
>> Also note that the bug is locale-specific:
>>
>> LANG=C R
>>
>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
>> [1] "12345"
>>> sub(".*(\\d{5}).*", "\\1", test2)
>> [1] "12345"
>>
>> Also note that this is not Mac-specific:
>>
>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>> sub(".*(\\d{5}).*", "\\1", test2)
>> [1] "WWWWW"
>>> system("uname -sr")
>> Linux 2.6.32-trunk-amd64
>>> Sys.getlocale("LC_CTYPE")
>> [1] "en_US.UTF-8"
>>
>>
>> Cheers,
>> Simon
>>
>>
>>
>> On May 6, 2010, at 6:54 AM, David Winsemius wrote:
>>
>>>
>>> On May 6, 2010, at 2:21 AM, steven mosher wrote:
>>>
>>>> see below,
>>>>
>>>> using a regex in sub() fails if the pattern is //d{5} and suceeds
>>>> if the pattern [0-9] {5} is used.. see the test cases below.
>>>>
>>>> issue was not on windows machine and david and I had it on MAC.
>>>
>>> Except we both were using \\d rather than //d.
>>>
>>> I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.)
>>>
>>> --
>>> David.
>>>
>>>> sessionInfo()
>>> R version 2.10.1 RC (2009-12-09 r50695)
>>> x86_64-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] tcltk stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3 lattice_0.18-3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] chron_2.3-35 grid_2.10.1 tools_2.10.1
>>>>
>>>> r11
>>>>
>>>> mac os 10.5
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: steven mosher <moshersteven at gmail.com>
>>>> Date: Wed, May 5, 2010 at 3:25 PM
>>>> Subject: Re: [R] extracting a matched string using regexpr
>>>> To: David Winsemius <dwinsemius at comcast.net>
>>>> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help <
>>>> r-help at r-project.org>
>>>>
>>>>
>>>> with a fresh restart
>>>>
>>>>
>>>>
>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>>
>>>>> test
>>>> [1]
>>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>> [1] "</th>"
>>>>> sub(".*([0-9]{5}).*", "\\1", test)
>>>> [1] "88958"
>>>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>>>> sub(".*(\\d{5}).*", "\\1", test2)
>>>> [1] "WWWWW"
>>>>>
>>>>> sub(".*(\\d{5}).*", "\\1", test2)
>>>> [1] "WWWWW"
>>>>> sub(".*([0-9]{5}).*", "\\1", test2)
>>>> [1] "12345"
>>>>
>>>>
>>>> Steve.
>>>>
>>>>
>>>>
>>>> On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net>wrote:
>>>>
>>>>>
>>>>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:
>>>>>
>>>>> Here are two ways to extract 5 digits.
>>>>>>
>>>>>> In the first one \\1 refers to the portion matched between the
>>>>>> parentheses in the regular expression.
>>>>>>
>>>>>> In the second one strapply is like apply where the object to be worked
>>>>>> on is the first argument (array for apply, string for strapply) the
>>>>>> second modifies it (which dimension for apply, regular expression for
>>>>>> strapply) and the last is a function which acts on each value
>>>>>> (typically each row or column for apply and each match for strapply).
>>>>>> In this case we use c as our function to just return all the results.
>>>>>> They are returned in a list with one component per string but here
>>>>>> test is just a single string so we get a list one long and we ask for
>>>>>> the contents of the first component using [[1]].
>>>>>>
>>>>>> # 1 - sub
>>>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>>>>
>>>>>> test
>>>>> [1]
>>>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>>
>>>>> I get different results than I expected given that "\\d" should be
>>>>> synonymous with "[0-9]":
>>>>>
>>>>>
>>>>>> sub(".*([0-9]{5}).*", "\\1", test)
>>>>> [1] "88958"
>>>>>
>>>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>>> [1] "</th>"
>>>>>
>>>>> --
>>>>> David.
>>>>>
>>>>>>
>>>>>> # 2 - strapply - see http://gsubfn.googlecode.com
>>>>>> library(gsubfn)
>>>>>> strapply(test, "\\d{5}", c)[[1]]
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Given a text like
>>>>>>>
>>>>>>> I want to be able to extract a matched regular expression from a piece of
>>>>>>> text.
>>>>>>>
>>>>>>> this apparently works, but is pretty ugly
>>>>>>> # some html
>>>>>>>
>>>>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>>>> # a pattern to extract 5 digits
>>>>>>>
>>>>>>>> pattern<-"[0-9]{5}"
>>>>>>>>
>>>>>>> # regexpr returns a start point[1] and an attribute "match.length"
>>>>>>> attr(,"match.length)
>>>>>>> # get the substring from the start point to the stop point.. where stop =
>>>>>>> start +length-1
>>>>>>>
>>>>>>>>
>>>>>>>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
>>>>>>>
>>>>>>>> answer
>>>>>>>>
>>>>>>> [1] "88958"
>>>>>>>
>>>>>>> I tried using sub(pattern, replacement, x ) with a regexp that captured
>>>>>>> the
>>>>>>> group. I'd found an example of this in the mails
>>>>>>> but it didnt seem to work..
>>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>> David Winsemius, MD
>>>>> West Hartford, CT
>>>>>
>>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> R-SIG-Mac mailing list
>>>> R-SIG-Mac at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> R-SIG-Mac at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>
>>>
>>
>
> David Winsemius, MD
> West Hartford, CT
>
>
More information about the R-SIG-Mac
mailing list