[R-SIG-Mac] Fwd: [R] extracting a matched string using regexpr Possible BUG

Thu May 6 18:10:45 CEST 2010

On May 6, 2010, at 11:50 AM, David Winsemius wrote:

> Two Q's:
> A) Is this supposed to happen with perl-mode?:
> 
> > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
> >
> > sub(".*(\\d{5}).*", "\\1", test, perl=TRUE)
> [1] "88958\nW</th><th>26m</th>"
> >
> > sub(".*([0-9]{5}).*", "\\1", test, perl=TRUE)
> [1] "88958\nW</th><th>26m</th>"
> 

Nope - perl does take EOL into account so .* will be matched only to the end of line. For your purposes you want to enable ?s option, so you probably meant:

> sub("(?s).*(\\d{5}).*", "\\1", test, perl=TRUE)
[1] "88958"

> Looks to me that a period is being improperly recognized.
> 
> On May 6, 2010, at 11:28 AM, Simon Urbanek wrote:
> 
>> FWIW I don't think \d is a basic regexp
> 
> B) With regard to the default (which I read to be  extended rather than basic) vs. perl-like, the Extended section of the regex documentation contains:
> 
> " Symbols \d, \s, \D and \S denote the digit and space classes and their negations."
> 

Yes, you're right - extended is the default.

Cheers,
Simon

>> so as I would expect the perl mode to work and it does:
>> 
>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
>> [1] "12345"
>> 
>> Yet I agree that if should either fail (i.e. return the unmodified string) or return 12345.
>> 
>> Also note that the bug is locale-specific:
>> 
>> LANG=C R
>> 
>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
>> [1] "12345"
>>> sub(".*(\\d{5}).*", "\\1", test2)
>> [1] "12345"
>> 
>> Also note that this is not Mac-specific:
>> 
>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>> sub(".*(\\d{5}).*", "\\1", test2)
>> [1] "WWWWW"
>>> system("uname -sr")
>> Linux 2.6.32-trunk-amd64
>>> Sys.getlocale("LC_CTYPE")
>> [1] "en_US.UTF-8"
>> 
>> 
>> Cheers,
>> Simon
>> 
>> 
>> 
>> On May 6, 2010, at 6:54 AM, David Winsemius wrote:
>> 
>>> 
>>> On May 6, 2010, at 2:21 AM, steven mosher wrote:
>>> 
>>>> see below,
>>>> 
>>>> using a regex in sub()  fails if the pattern is //d{5} and suceeds
>>>> if the pattern [0-9] {5} is used.. see the test cases below.
>>>> 
>>>> issue was not on windows machine and david and I had it on MAC.
>>> 
>>> Except we both were using \\d rather than //d.
>>> 
>>> I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.)
>>> 
>>> -- 
>>> David.
>>> 
>>>> sessionInfo()
>>> R version 2.10.1 RC (2009-12-09 r50695)
>>> x86_64-apple-darwin9.8.0
>>> 
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>> 
>>> attached base packages:
>>> [1] tcltk     stats     graphics  grDevices utils     datasets  methods   base
>>> 
>>> other attached packages:
>>> [1] gsubfn_0.5-2   proto_0.3-8    zoo_1.6-3      SASxport_1.2.3 lattice_0.18-3
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] chron_2.3-35 grid_2.10.1  tools_2.10.1
>>>> 
>>>> r11
>>>> 
>>>> mac os 10.5
>>>> 
>>>> ---------- Forwarded message ----------
>>>> From: steven mosher <moshersteven at gmail.com>
>>>> Date: Wed, May 5, 2010 at 3:25 PM
>>>> Subject: Re: [R] extracting a matched string using regexpr
>>>> To: David Winsemius <dwinsemius at comcast.net>
>>>> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help <
>>>> r-help at r-project.org>
>>>> 
>>>> 
>>>> with a fresh restart
>>>> 
>>>> 
>>>> 
>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>> 
>>>>> test
>>>> [1]
>>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>> [1] "</th>"
>>>>> sub(".*([0-9]{5}).*", "\\1", test)
>>>> [1] "88958"
>>>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>>>>> sub(".*(\\d{5}).*", "\\1", test2)
>>>> [1] "WWWWW"
>>>>> 
>>>>> sub(".*(\\d{5}).*", "\\1", test2)
>>>> [1] "WWWWW"
>>>>> sub(".*([0-9]{5}).*", "\\1", test2)
>>>> [1] "12345"
>>>> 
>>>> 
>>>> Steve.
>>>> 
>>>> 
>>>> 
>>>> On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net>wrote:
>>>> 
>>>>> 
>>>>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:
>>>>> 
>>>>> Here are two ways to extract 5 digits.
>>>>>> 
>>>>>> In the first one \\1 refers to the portion matched between the
>>>>>> parentheses in the regular expression.
>>>>>> 
>>>>>> In the second one strapply is like apply where the object to be worked
>>>>>> on is the first argument (array for apply, string for strapply) the
>>>>>> second modifies it (which dimension for apply, regular expression for
>>>>>> strapply) and the last is a function which acts on each value
>>>>>> (typically each row or column for apply and each match for strapply).
>>>>>> In this case we use c as our function to just return all the results.
>>>>>> They are returned in a list with one component per string but here
>>>>>> test is just a single string so we get a list one long and we ask for
>>>>>> the contents of the first component using [[1]].
>>>>>> 
>>>>>> # 1 - sub
>>>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>>>> 
>>>>>> test
>>>>> [1]
>>>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>> 
>>>>> I get different results than I expected given that "\\d" should be
>>>>> synonymous with "[0-9]":
>>>>> 
>>>>> 
>>>>>> sub(".*([0-9]{5}).*", "\\1", test)
>>>>> [1] "88958"
>>>>> 
>>>>>> sub(".*(\\d{5}).*", "\\1", test)
>>>>> [1] "</th>"
>>>>> 
>>>>> --
>>>>> David.
>>>>> 
>>>>>> 
>>>>>> # 2 - strapply - see http://gsubfn.googlecode.com
>>>>>> library(gsubfn)
>>>>>> strapply(test, "\\d{5}", c)[[1]]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Given a text like
>>>>>>> 
>>>>>>> I want to be able to extract a matched regular expression from a piece of
>>>>>>> text.
>>>>>>> 
>>>>>>> this apparently works, but is pretty ugly
>>>>>>> # some html
>>>>>>> 
>>>>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>>>>>> # a pattern to extract 5 digits
>>>>>>> 
>>>>>>>> pattern<-"[0-9]{5}"
>>>>>>>> 
>>>>>>> # regexpr returns a start point[1] and an attribute "match.length"
>>>>>>> attr(,"match.length)
>>>>>>> # get the substring from the start point to the stop point.. where stop =
>>>>>>> start +length-1
>>>>>>> 
>>>>>>>> 
>>>>>>>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
>>>>>>> 
>>>>>>>> answer
>>>>>>>> 
>>>>>>> [1] "88958"
>>>>>>> 
>>>>>>> I tried using sub(pattern, replacement, x )  with a regexp that captured
>>>>>>> the
>>>>>>> group. I'd found an example of this in the mails
>>>>>>> but it didnt seem to work..
>>>>>>> 
>>>>>> 
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>> 
>>>>> 
>>>>> David Winsemius, MD
>>>>> West Hartford, CT
>>>>> 
>>>>> 
>>>> 
>>>> 	[[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> R-SIG-Mac mailing list
>>>> R-SIG-Mac at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>> 
>>> David Winsemius, MD
>>> West Hartford, CT
>>> 
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> R-SIG-Mac at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>> 
>>> 
>> 
> 
> David Winsemius, MD
> West Hartford, CT
> 
>