[Rd] grep with fixed=TRUE and ignore.case=TRUE
Prof Brian Ripley
ripley at stats.ox.ac.uk
Mon May 14 10:42:44 CEST 2007
On Fri, 11 May 2007, Petr Savicky wrote:
> On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
>> I suggest you collaborate with the person who replied that he thought this
>> was a good idea to supply patches against the R-devel sources for
>> scrutiny.
>
> A possible solution is to use strncasecmp instead of strncmp
> in function fgrep_one in R-devel/src/main/character.c.
>
> Corresponding modification of character.c is at
> http://www.cs.cas.cz/~savicky/ignore_case/character.c
> and diff file w.r.t. the original character.c (downloaded today) is at
> http://www.cs.cas.cz/~savicky/ignore_case/diff.txt
>
> This seems to work in my installation of R-devel:
>
> > x <- c("D.G cat", "d.g cat", "dog cat")
> > z <- "d.g"
> > grep(z, x, ignore.case = F, fixed = T)
> [1] 2
> > grep(z, x, ignore.case = T, fixed = T) # this is the new behavior
> [1] 1 2
> > grep(z, x, ignore.case = T, fixed = F)
> [1] 1 2 3
> >
>
> Since fgrep_one is used many times in character.c, adding igcase_opt as
> an additional argument would imply extensive changes to the file.
> So, I introduced a new function fgrep_one_igcase called only once in
> the file. Another solution is possible.
>
> I do not understand well handling multibyte chars, so I did not test
> the function with real multibyte chars, although the code for
> this option is used.
Thanks for looking into this.
strncasecmp is not standard C (not even C99), but R does have a substitute
for it. Unfortunately strncasecmp is not usable with multibyte charsets:
Linux systems have wcsncasecmp but that is not portable. In these days of
widespread use of UTF-8 that is a blocking issue, I am afraid.
In the case of grep I think all you need is
grep(tolower(pattern), tolower(x), fixed = TRUE)
and similarly for regexpr.
> Ignore case option is not meaningfull in gsub.
sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
is different from 'ignore.case=FALSE', and I see the meaning as clear.
So what did you mean? (Unfortunately the tolower trick does not work for
[g]sub.)
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list