[R-pkg-devel] Warning... unable to translate 'Ekstr<f8>m' to a wide string; Error... input string 1 is invalid

Tue Jul 19 21:12:38 CEST 2022

On 19/07/2022 2:23 p.m., Spencer Graves wrote:
> Hi, Ivan et al.:
> 
> 
> On 7/19/22 1:03 PM, Ivan Krylov wrote:
>> On Tue, 19 Jul 2022 12:32:20 -0500
>> Spencer Graves <spencer.graves using effectivedefense.org> wrote:
>>
>>> Can someone provide me with a link to the correct development
>>> version of help('iconv')?  The current version includes the exact
>>> offending "\x" strings that I have.
>>
>> http://svn.r-project.org/R/trunk/src/library/base/man/iconv.Rd
>>
>> It still does, because it works with byte strings the right way: by
>> passing them to iconv(), which is designed to work with bytes in
>> "unknown" encodings.
>>
>> In contrast, your use of arbitrary bytes with gsub() is invalid,
>> because gsub() assumes that the strings match their declared encoding:
>> UTF-8, Latin-1, or the native locale encoding. (See ?Encoding.)
>>
>> When you write "Ekstr\xf8m", you get a string that consists of Latin-1
>> bytes but has the wrong encoding property set. Given this string,
>> gsub() and friends will break on a UTF-8 system (because "r\xf8m" is
>> not a valid UTF-8 sequence of bytes), while iconv() will not.
>>
>> Depending on the desired semantics of subNonStandardCharacters(), you
>> might be able to avoid the failures with the useBytes argument, or you
>> might silently return invalid data in some corner cases. Is the "x"
>> argument supposed to be bytes in arbitrary encoding, or properly decoded
>> characters that might include those that don't map to ASCII?
>>
> 
> 	  Wow.  So what's the recommended fix?
> 
> 
> 	  If I understand correctly, "\u**" should work with ** being f8, f6,
> df, or fc [all hex digits, I assume?].  However, "\u00**" may be
> preferred over "\u**", and "\u{**}" may be better still.

That's all correct.

> 	  The blog that Tomas wrote might be more useful if it included a
> recommendation like this.

I think that advice is too specific to this example.  Here's a short 
explanation about what's going on:

Strings in R are assumed to be in the ASCII encoding if they have no 
bytes bigger than 128 (hex 80).

If they do have such bytes, they can be marked as "latin1", or "UTF-8", 
or not marked, in which case they are assumed to be in the local encoding.

If you write a string containing "\u....", it is marked as being in the 
"utf8" encoding.

If you write a string containing "\x....", it is not marked.

Thus if you are writing strings for others to use, you don't know how 
those strings will be interpreted unless you explicitly set their 
encoding.  For example, this is ambiguous:

    x <- "fa\xE7ile"

This is not:

    x <- "fa\xE7ile"
    Encoding(x) <- "latin1"

The advice you received to change the \x to \u works for your examples, 
but might fail in other examples.  As help("Quotes") says, "\uE7" is the 
Unicode code point hex E7, which is a c with a cedilla.  (The two hex 
digit Unicode values from 80 to FF match the Latin-1 values; but not 
everyone lives and works in a Latin-1 locale, so \xE7 might not be 
equivalent to \uE7 for some people.)

You can have 1-4 hex digits after \u.  If the next character happens to 
be a hex digit, you'll get some other character, e.g. "\uE7" is a ç (a c 
with a cedilla), but "\uE7a" is a single Thai character, and "\uE7ab" is 
some other single character (in the "private use area" of Unicode).

So it's safest to use exactly 4 hex digits as \u00E7, or to wrap the 
value in curly braces, \u{E7}.

Some Unicode characters need more than 4 hex digits.  Use \U for those.

Duncan Murdoch