[Rd] grep with fixed=TRUE and ignore.case=TRUE

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon May 14 10:42:44 CEST 2007


On Fri, 11 May 2007, Petr Savicky wrote:

> On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
>> I suggest you collaborate with the person who replied that he thought this
>> was a good idea to supply patches against the R-devel sources for
>> scrutiny.
>
> A possible solution is to use strncasecmp instead of strncmp
> in function fgrep_one in R-devel/src/main/character.c.
>
> Corresponding modification of character.c is at
>  http://www.cs.cas.cz/~savicky/ignore_case/character.c
> and diff file w.r.t. the original character.c (downloaded today) is at
>  http://www.cs.cas.cz/~savicky/ignore_case/diff.txt
>
> This seems to work in my installation of R-devel:
>
>  > x <- c("D.G cat", "d.g cat", "dog cat")
>  > z <- "d.g"
>  > grep(z, x, ignore.case = F, fixed = T)
>  [1] 2
>  > grep(z, x, ignore.case = T, fixed = T)  # this is the new behavior
>  [1] 1 2
>  > grep(z, x, ignore.case = T, fixed = F)
>  [1] 1 2 3
>  >
>
> Since fgrep_one is used many times in character.c, adding igcase_opt as
> an additional argument would imply extensive changes to the file.
> So, I introduced a new function fgrep_one_igcase called only once in
> the file. Another solution is possible.
>
> I do not understand well handling multibyte chars, so I did not test
> the function with real multibyte chars, although the code for
> this option is used.

Thanks for looking into this.

strncasecmp is not standard C (not even C99), but R does have a substitute 
for it.  Unfortunately strncasecmp is not usable with multibyte charsets: 
Linux systems have wcsncasecmp but that is not portable.  In these days of 
widespread use of UTF-8 that is a blocking issue, I am afraid.

In the case of grep I think all you need is

grep(tolower(pattern), tolower(x), fixed = TRUE)

and similarly for regexpr.

> Ignore case option is not meaningfull in gsub.

sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)

is different from 'ignore.case=FALSE', and I see the meaning as clear.
So what did you mean?  (Unfortunately the tolower trick does not work for 
[g]sub.)

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list