[R-pkg-devel] Warning... unable to translate 'Ekstr<f8>m' to a wide string; Error... input string 1 is invalid
Spencer Graves
@pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Tue Jul 19 20:23:11 CEST 2022
Hi, Ivan et al.:
On 7/19/22 1:03 PM, Ivan Krylov wrote:
> On Tue, 19 Jul 2022 12:32:20 -0500
> Spencer Graves <spencer.graves using effectivedefense.org> wrote:
>
>> Can someone provide me with a link to the correct development
>> version of help('iconv')? The current version includes the exact
>> offending "\x" strings that I have.
>
> http://svn.r-project.org/R/trunk/src/library/base/man/iconv.Rd
>
> It still does, because it works with byte strings the right way: by
> passing them to iconv(), which is designed to work with bytes in
> "unknown" encodings.
>
> In contrast, your use of arbitrary bytes with gsub() is invalid,
> because gsub() assumes that the strings match their declared encoding:
> UTF-8, Latin-1, or the native locale encoding. (See ?Encoding.)
>
> When you write "Ekstr\xf8m", you get a string that consists of Latin-1
> bytes but has the wrong encoding property set. Given this string,
> gsub() and friends will break on a UTF-8 system (because "r\xf8m" is
> not a valid UTF-8 sequence of bytes), while iconv() will not.
>
> Depending on the desired semantics of subNonStandardCharacters(), you
> might be able to avoid the failures with the useBytes argument, or you
> might silently return invalid data in some corner cases. Is the "x"
> argument supposed to be bytes in arbitrary encoding, or properly decoded
> characters that might include those that don't map to ASCII?
>
Wow. So what's the recommended fix?
If I understand correctly, "\u**" should work with ** being f8, f6,
df, or fc [all hex digits, I assume?]. However, "\u00**" may be
preferred over "\u**", and "\u{**}" may be better still.
The blog that Tomas wrote might be more useful if it included a
recommendation like this.
Thanks for all your work to make R better and thereby help people
everywhere extract better information from the data available to them.
Spencer
More information about the R-package-devel
mailing list