[Rd] pb in regular expression with the character "-" (PR#9437)

ripley at stats.ox.ac.uk ripley at stats.ox.ac.uk
Fri Jan 5 00:06:00 CET 2007


Both Solaris 8 grep and GNU grep 2.5.1 give

gannet% cat > foo.txt
a-a
b
gannet% egrep '[d|-|c]' foo.txt
gannet% egrep '[-|c]' foo.txt
a-a

agreeing exactly with R (and the POSIX standard) and contradicting 'Fan'.


On Thu, 4 Jan 2007, Fan wrote:

> Let me detail a bit my bug report:
>
> the two commands ("expected" vs "strange") should return the
> same result, the objective of the commands is to test the presence
> of several characters, '-'included.
>
> The order in which we specify the different characters must not be
> an issue, i.e., to test the presence of several characters, including
> say char_1, the regular expressions [char_1|char_2|char_3] and 
> [char_2|char_1|char_3] should play the same role. Other softwares
> work just like this.
>
> What's reported is that R actually returns different result for the
> character "-" (\- in a RE) regarding it's position in the regular
> expression, and the "perl" option would not be relevant.

As described in the relevant international standard and R's own 
documentation.

> Prof Brian Ripley wrote:
>> Why do you think this is a bug in R?  You have not told us what you 
>> expected, but the character range |-| contains only | .  Not agreeing with 
>> your expectations (unstated or otherwise) is not a bug in R.
>> 
>> \- is the same as -, and - is special in character classes.  (If it is 
>> first or last it is treated literally.)  And | is not a metacharacter 
>> inside a character class.  Also,
>> 
>>> grep("[d\\-c]", c("a-a","b"))
>> 
>> [1] 1 2
>> 
>>> grep("[d\\-c]", c("a-a","b"), perl=TRUE)
>> 
>> [1] 1
>> 
>> shows that escaping - works only in perl (which you will find from the 
>> background references mentioned, e.g.
>>
>>   The interpretation of an ordinary character preceded by a backslash
>>   ('\') is undefined.
>> 
>> .)
>> 
>> This is all carefully documented in ?regexp, e.g.
>>
>>      Patterns are described here as they would be printed by 'cat': do
>>      remember that backslashes need to be doubled in entering R
>>      character strings from the keyboard.
>> 
>> 
>> This is not the first time you have wasted our resources with false bug 
>> reports, so please show more respect for the R developers' time.
>> You were also explicitly asked not to report on obselete versions of R.
>> 
>> On Wed, 3 Jan 2007, xiao.gang.fan1 at libertysurf.fr wrote:
>> 
>>> Full_Name: FAN
>>> Version: 2.4.0
>>> OS: Windows
>>> Submission from: (NULL) (159.50.101.9)
>>> 
>>> 
>>> These are expected:
>>> 
>>>> grep("[\-|c]", c("a-a","b"))
>>> 
>>> [1] 1
>>> 
>>>> gsub("[\-|c]", "&", c("a-a","b"))
>>> 
>>> [1] "a&a" "b"
>>> 
>>> but these are strange:
>>> 
>>>> grep("[d|\-|c]", c("a-a","b"))
>>> 
>>> integer(0)
>>> 
>>>> gsub("[d|\-|c]", "&", c("a-a","b"))
>>> 
>>> [1] "a-a" "b"
>>> 
>>> Thanks
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list