[Rd] (PR#9437) pb in regular expression with the character "-"

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Jan 8 13:34:00 CET 2007


On Mon, 8 Jan 2007, Hin-Tak Leung wrote:

> May I chip in at this point - I agree the bug report was invalid, but
> many of the replies were missing the point, as far as I see. It
> wasn't the backslash escape that "Fan" is *mainly* confused about
> (which he obviously is...), but the uses of the different brackets:
> [] ,() .
>
> He/She was expecting this:
>         egrep '[a|\-|c]' foo.txt
> to work the same as:
>         egrep '(a|\-|c)'  foo.txt
>
> which they do not. They are totally different. (and he doesn't know
> the proper use of "|" either... so we basically have established that
> "Fan" doesn't understand how \, |, [] and () are used in
> regular expressions...).

And I think *you* have missed the point of the posting you quoted here, 
which was to show how '[a|-|c]' actually worked in 'other software'.

How '[a|\-|c]' (or '(a|\-|c)) works in egrep is explicitly undefined by 
POSIX, as I said in my original reply.

>
> HTL
>
> ripley at stats.ox.ac.uk wrote:
>> Both Solaris 8 grep and GNU grep 2.5.1 give
>>
>> gannet% cat > foo.txt
>> a-a
>> b
>> gannet% egrep '[d|-|c]' foo.txt
>> gannet% egrep '[-|c]' foo.txt
>> a-a
>>
>> agreeing exactly with R (and the POSIX standard) and contradicting 'Fan'.
>>
>>
>> On Thu, 4 Jan 2007, Fan wrote:
>>
>>> Let me detail a bit my bug report:
>>>
>>> the two commands ("expected" vs "strange") should return the
>>> same result, the objective of the commands is to test the presence
>>> of several characters, '-'included.
>>>
>>> The order in which we specify the different characters must not be
>>> an issue, i.e., to test the presence of several characters, including
>>> say char_1, the regular expressions [char_1|char_2|char_3] and
>>> [char_2|char_1|char_3] should play the same role. Other softwares
>>> work just like this.
>>>
>>> What's reported is that R actually returns different result for the
>>> character "-" (\- in a RE) regarding it's position in the regular
>>> expression, and the "perl" option would not be relevant.
>>
>> As described in the relevant international standard and R's own
>> documentation.
>>
>>> Prof Brian Ripley wrote:
>>>> Why do you think this is a bug in R?  You have not told us what you
>>>> expected, but the character range |-| contains only | .  Not agreeing with
>>>> your expectations (unstated or otherwise) is not a bug in R.
>>>>
>>>> \- is the same as -, and - is special in character classes.  (If it is
>>>> first or last it is treated literally.)  And | is not a metacharacter
>>>> inside a character class.  Also,
>>>>
>>>>> grep("[d\\-c]", c("a-a","b"))
>>>> [1] 1 2
>>>>
>>>>> grep("[d\\-c]", c("a-a","b"), perl=TRUE)
>>>> [1] 1
>>>>
>>>> shows that escaping - works only in perl (which you will find from the
>>>> background references mentioned, e.g.
>>>>
>>>>   The interpretation of an ordinary character preceded by a backslash
>>>>   ('\') is undefined.
>>>>
>>>> .)
>>>>
>>>> This is all carefully documented in ?regexp, e.g.
>>>>
>>>>      Patterns are described here as they would be printed by 'cat': do
>>>>      remember that backslashes need to be doubled in entering R
>>>>      character strings from the keyboard.
>>>>
>>>>
>>>> This is not the first time you have wasted our resources with false bug
>>>> reports, so please show more respect for the R developers' time.
>>>> You were also explicitly asked not to report on obselete versions of R.
>>>>
>>>> On Wed, 3 Jan 2007, xiao.gang.fan1 at libertysurf.fr wrote:
>>>>
>>>>> Full_Name: FAN
>>>>> Version: 2.4.0
>>>>> OS: Windows
>>>>> Submission from: (NULL) (159.50.101.9)
>>>>>
>>>>>
>>>>> These are expected:
>>>>>
>>>>>> grep("[\-|c]", c("a-a","b"))
>>>>> [1] 1
>>>>>
>>>>>> gsub("[\-|c]", "&", c("a-a","b"))
>>>>> [1] "a&a" "b"
>>>>>
>>>>> but these are strange:
>>>>>
>>>>>> grep("[d|\-|c]", c("a-a","b"))
>>>>> integer(0)
>>>>>
>>>>>> gsub("[d|\-|c]", "&", c("a-a","b"))
>>>>> [1] "a-a" "b"
>>>>>
>>>>> Thanks
>>>>>
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list