[Rd] grep with fixed=TRUE and ignore.case=TRUE
    Prof Brian Ripley 
    ripley at stats.ox.ac.uk
       
    Mon May 14 10:42:44 CEST 2007
    
    
  
On Fri, 11 May 2007, Petr Savicky wrote:
> On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
>> I suggest you collaborate with the person who replied that he thought this
>> was a good idea to supply patches against the R-devel sources for
>> scrutiny.
>
> A possible solution is to use strncasecmp instead of strncmp
> in function fgrep_one in R-devel/src/main/character.c.
>
> Corresponding modification of character.c is at
>  http://www.cs.cas.cz/~savicky/ignore_case/character.c
> and diff file w.r.t. the original character.c (downloaded today) is at
>  http://www.cs.cas.cz/~savicky/ignore_case/diff.txt
>
> This seems to work in my installation of R-devel:
>
>  > x <- c("D.G cat", "d.g cat", "dog cat")
>  > z <- "d.g"
>  > grep(z, x, ignore.case = F, fixed = T)
>  [1] 2
>  > grep(z, x, ignore.case = T, fixed = T)  # this is the new behavior
>  [1] 1 2
>  > grep(z, x, ignore.case = T, fixed = F)
>  [1] 1 2 3
>  >
>
> Since fgrep_one is used many times in character.c, adding igcase_opt as
> an additional argument would imply extensive changes to the file.
> So, I introduced a new function fgrep_one_igcase called only once in
> the file. Another solution is possible.
>
> I do not understand well handling multibyte chars, so I did not test
> the function with real multibyte chars, although the code for
> this option is used.
Thanks for looking into this.
strncasecmp is not standard C (not even C99), but R does have a substitute 
for it.  Unfortunately strncasecmp is not usable with multibyte charsets: 
Linux systems have wcsncasecmp but that is not portable.  In these days of 
widespread use of UTF-8 that is a blocking issue, I am afraid.
In the case of grep I think all you need is
grep(tolower(pattern), tolower(x), fixed = TRUE)
and similarly for regexpr.
> Ignore case option is not meaningfull in gsub.
sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
is different from 'ignore.case=FALSE', and I see the meaning as clear.
So what did you mean?  (Unfortunately the tolower trick does not work for 
[g]sub.)
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
    
    
More information about the R-devel
mailing list