[Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings
Winston Chang
winstonchang1 at gmail.com
Wed Mar 4 03:12:54 CET 2015
After a bit more investigation, I think I've found the cause of the bug,
and I have a patch.
This bug happens with grep(), when:
* Running on Windows.
* The search uses fixed=TRUE.
* The search pattern is a single byte.
* The current locale has a multibyte encoding.
=======================
Here's an example that demonstrates the bug:
# First, create a 3-byte UTF-8 character
y <- rawToChar(as.raw(c(0xe6, 0xb8, 0x97)))
Encoding(y) <- "UTF-8"
y
# [1] "渗"
# In my default locale, grep with a single-char pattern and fixed=TRUE
# returns integer(0), as expected.
Sys.getlocale("LC_CTYPE")
# [1] "English_United States.1252"
grep("a", y, fixed = TRUE)
# integer(0)
# When the using a multibyte locale, grep with a single-char
# pattern and fixed=TRUE results in an error.
Sys.setlocale("LC_CTYPE", "chinese")
grep("a", y, fixed = TRUE)
# Error in grep("a", y, fixed = TRUE) : invalid multibyte string at '<97>'
=======================
I believe the problem is in the main/grep.c file, in the fgrep_one
function. It tests for a multi-byte character string locale
`mbcslocale`, and then for the `use_UTF8`, like so:
if (!useBytes && mbcslocale) {
...
} else if (!useBytes && use_UTF8) {
...
} else ...
This can be seen at
https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L668-L692
A similar pattern occurs in the fgrep_one_bytes function, at
https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L718-L736
I believe that the test order should be reversed; it should test first
for `use_UTF8`, and then for `mbcslocale`. This pattern occurs in a few
places in grep.c. It looks like this:
if (!useBytes && use_UTF8) {
...
} else if (!useBytes && mbcslocale) {
...
} else ...
=======================
This patch does what I described; it simply tests for `use_UTF8` first,
and then `mbcslocale`, in both fgrep_one and fgrep_one_bytes. I made
this patch against the 3.1.2 sources, and tested the example code above.
In both cases, grep() returned integer(0), as expected.
(The reason I made this change against 3.1.2 is because I had problems
getting the current trunk to compile on both Linux or Windows.)
diff --git src/main/grep.c src/main/grep.c
index 6e6ec3e..348c63d 100644
--- src/main/grep.c
+++ src/main/grep.c
@@ -664,27 +664,27 @@ static int fgrep_one(const char *pat, const char *target,
}
return -1;
}
- if (!useBytes && mbcslocale) { /* skip along by chars */
- mbstate_t mb_st;
+ if (!useBytes && use_UTF8) {
int ib, used;
- mbs_init(&mb_st);
for (ib = 0, i = 0; ib <= len-plen; i++) {
if (strncmp(pat, target+ib, plen) == 0) {
if (next != NULL) *next = ib + plen;
return i;
}
- used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
+ used = utf8clen(target[ib]);
if (used <= 0) break;
ib += used;
}
- } else if (!useBytes && use_UTF8) {
+ } else if (!useBytes && mbcslocale) { /* skip along by chars */
+ mbstate_t mb_st;
int ib, used;
+ mbs_init(&mb_st);
for (ib = 0, i = 0; ib <= len-plen; i++) {
if (strncmp(pat, target+ib, plen) == 0) {
if (next != NULL) *next = ib + plen;
return i;
}
- used = utf8clen(target[ib]);
+ used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
if (used <= 0) break;
ib += used;
}
@@ -714,21 +714,21 @@ static int fgrep_one_bytes(const char *pat, const char *target, int len,
if (*p == pat[0]) return i;
return -1;
}
- if (!useBytes && mbcslocale) { /* skip along by chars */
- mbstate_t mb_st;
+ if (!useBytes && use_UTF8) { /* not really needed */
int ib, used;
- mbs_init(&mb_st);
for (ib = 0, i = 0; ib <= len-plen; i++) {
if (strncmp(pat, target+ib, plen) == 0) return ib;
- used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
+ used = utf8clen(target[ib]);
if (used <= 0) break;
ib += used;
}
- } else if (!useBytes && use_UTF8) { /* not really needed */
+ } else if (!useBytes && mbcslocale) { /* skip along by chars */
+ mbstate_t mb_st;
int ib, used;
+ mbs_init(&mb_st);
for (ib = 0, i = 0; ib <= len-plen; i++) {
if (strncmp(pat, target+ib, plen) == 0) return ib;
- used = utf8clen(target[ib]);
+ used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
if (used <= 0) break;
ib += used;
}
-Winston
More information about the R-devel
mailing list