[Rd] grep with fixed=TRUE and ignore.case=TRUE

Thu May 17 11:28:55 CEST 2007

On Thu, 17 May 2007, Petr Savicky wrote:

>> strncasecmp is not standard C (not even C99), but R does have a substitute
>> for it.  Unfortunately strncasecmp is not usable with multibyte charsets:
>> Linux systems have wcsncasecmp but that is not portable.  In these days of
>> widespread use of UTF-8 that is a blocking issue, I am afraid.
>
> What could help are the functions mbrtowc and towctrans and simple
> long integer comparison. Are the functions mbrtowc and towctrans
> available under Windows? mbrtowc seems to be available as Rmbrtowc
> in src/gnuwin32/extra.c.
>
> I did not find towctrans defined in R sources, but it is in 
> gnuwin32/Rdll.hide

I don't see it in Rdll.hide.  It is a C99 function (see your unix man 
page).

> and used in do_tolower. Does this mean that tolower is not usable
> with utf-8 under Windows?

UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in 
so far as that makes sense: Chinese chars do not have 'case').

Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is 
not currently active.

>> In the case of grep I think all you need is
>>
>> grep(tolower(pattern), tolower(x), fixed = TRUE)
>>
>> and similarly for regexpr.
>
> Yes. this is correct, but it has disadvantages. It needs more
> space and, if value=TRUE, we would have to do something like
>   x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)]
> This is hard to implement in src/library/base/R/grep.R,
> where the call to .Internal(grep(pattern,...)) is the last command
> and I think this should be preserved.
>
>>> Ignore case option is not meaningfull in gsub.
>>
>> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
>>
>> is different from 'ignore.case=FALSE', and I see the meaning as clear.
>> So what did you mean?  (Unfortunately the tolower trick does not work for
>> [g]sub.)
>
> The meaning of ignore.case in [g]sub is problematic due to the following.
>  sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
> produces
>  [1] "xyzD" "xyzd"
> but the user may in fact need the following
>  [1] "XYZD" "xyzd"

He may, but that is not what 'ignore case' means, more like 'case 
honouring'.

> It is correct that "xyzD" "xyzd" is produced, but the user
> should be aware of the fact that several substitutions like
>  x <- sub("abc", "xyz", c("ABCD", "abcd"))   # ignore.case=FALSE
>  sub("ABC", "XYZ", x)  # ignore.case=FALSE
> may be more useful.
>
> I have another question concerning the speed of grep. I expected that
> fgrep_one function is slower than calling a library routine
> for regular expressions. In particular, if the pattern has a lot of
> long partial matches in the target string, I expected that it may be much
> slower. A short example is
>  y <- "aaaaaaaaab"
>  x <- "aaaaaaaaaaaaaaaaaaab"
>  grep(y,x)
> which requires 110 comparisons (10 comparisons for each of 11 possible
> beginnings of y in x). In general, the complexity in the worst case is
> O(m*n), where m,n are the lengths of y,x resp. I would expect that
> the library function for matching regular expressions needs
> time O(m+n) and so will be faster. However, the result obtained
> on a larger example is
>
>  > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
>  > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
>  > y <- paste(c(rep("a", times = 10000), x2), collapse = "")
>  > z <- rep(y, times = 100)
>
>  > system.time(i <- grep(x1, z, fixed = T))
>  [1] 1.970 0.000 1.985 0.000 0.000
>
>  > system.time(i <- grep(x1, z, fixed = F))   # reg. expr. surprisingly slow (*)
>  [1] 40.374  0.003 40.381  0.000  0.000
>
>  > system.time(i <- grep(x2, z, fixed = T))
>  [1] 0.113 0.000 0.113 0.000 0.000
>
>  > system.time(i <- grep(x2, z, fixed = F))  # reg. expr. faster than fgrep_one
>  [1] 0.019 0.000 0.019 0.000 0.000
>
> Do you have an explanation of these results, in particular (*)?

Yes, there is a comment on the help page to that effect.  But these are 
highly atypical uses. Try perl=TRUE, and be aware that the locale matters 
a lot in such tests (via the charset).

No one is attempting to make R a fast string-processing language and so 
developers resources are spent on performance where it matters to more 
typical usage.  (E.g. reducing duplication in as.double and friends speeds 
up just about every R session, and speeds up some numerical sessions 
dramatically.)

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595