[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")
Milan Bouchet-Valat
nalimilan at club.fr
Mon Sep 9 10:49:21 CEST 2013
Hi!
I experience an error with an invalid UTF-8 character passed to
gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the
default) no error happens. (The character itself was read from an
invalid HTML file.) Illustration of the error:
gsub("a", "", "\U3e3965", perl=FALSE)
# [1] "\U3e3965"
gsub("a", "", "\U3e3965", perl=TRUE)
# Error in gsub("a", "", "\U3e3965", perl = TRUE) :
# input string 1 is invalid UTF-8
The error message in the second command seems to come from
src/main/grep.c:1640 (in do_gsub):
if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1);
utf8Valid() relies on valid_utf8() from PCRE, whose behavior is
described in src/extra/pcre/pcre_valid_utf8.c.
Even more problematic/interesting is the fact that iconv() does not
consider the above character as invalid, as it does not replace it when
using the sub argument.
> iconv("a\U3e3965", sub="")
[1] "a\U003e3965"
On the contrary, an invalid sequence such as \xff is substituted:
iconv("a\xff", sub="")
# [1] "a"
This makes it difficult to sanitize the string before passing it to
gsub(perl=TRUE). Thus, I'm wondering whether something could be done,
and where. Should iconv() and PCRE be made to agree on the definition of
an invalid UTF-8 sequence?
Regards
More information about the R-devel
mailing list