[Rd] Invalid UTF-8 with gsub(perl=TRUE) and iconv(sub="")

Milan Bouchet-Valat nalimilan at club.fr
Mon Sep 9 18:46:20 CEST 2013


Le lundi 09 septembre 2013 à 13:59 +0100, Prof Brian Ripley a écrit :
> On 09/09/2013 09:49, Milan Bouchet-Valat wrote:
> > Hi!
> >
> > I experience an error with an invalid UTF-8 character passed to
> > gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the
> > default) no error happens. (The character itself was read from an
> > invalid HTML file.) Illustration of the error:
> >
> > gsub("a", "", "\U3e3965", perl=FALSE)
> > # [1] "\U3e3965"
> > gsub("a", "", "\U3e3965", perl=TRUE)
> > # Error in gsub("a", "", "\U3e3965", perl = TRUE) :
> > #   input string 1 is invalid UTF-8
> >
> >
> > The error message in the second command seems to come from
> > src/main/grep.c:1640 (in do_gsub):
> > if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1);
> >
> > utf8Valid() relies on valid_utf8() from PCRE, whose behavior is
> > described in src/extra/pcre/pcre_valid_utf8.c.
> >
> >
> >
> > Even more problematic/interesting is the fact that iconv() does not
> > consider the above character as invalid, as it does not replace it when
> > using the sub argument.
> >> iconv("a\U3e3965", sub="")
> > [1] "a\U003e3965"
> >
> > On the contrary, an invalid sequence such as \xff is substituted:
> > iconv("a\xff", sub="")
> > # [1] "a"
> >
> > This makes it difficult to sanitize the string before passing it to
> > gsub(perl=TRUE). Thus, I'm wondering whether something could be done,
> > and where. Should iconv() and PCRE be made to agree on the definition of
> > an invalid UTF-8 sequence?
> 
> iconv() is using a system service: read its help page.  So you know 
> where to report this ....
Yeah, but why is "\U003e3965" considered valid by gsub(perl=TRUE) and
printed as a character on Windows 7, and not on Linux? Do you think this
is a separate bug on Windows?


Thanks for your help



More information about the R-devel mailing list