[R-pkg-devel] Warning... unable to translate 'Ekstr<f8>m' to a wide string; Error... input string 1 is invalid

Wed Jul 20 02:56:45 CEST 2022

On 7/19/22 2:12 PM, Duncan Murdoch wrote:
> On 19/07/2022 2:23 p.m., Spencer Graves wrote:
>> Hi, Ivan et al.:
>>
>>
>> On 7/19/22 1:03 PM, Ivan Krylov wrote:
>>> On Tue, 19 Jul 2022 12:32:20 -0500
>>> Spencer Graves <spencer.graves using effectivedefense.org> wrote:
>>>
>>>> Can someone provide me with a link to the correct development
>>>> version of help('iconv')?  The current version includes the exact
>>>> offending "\x" strings that I have.
>>>
>>> http://svn.r-project.org/R/trunk/src/library/base/man/iconv.Rd
>>>
>>> It still does, because it works with byte strings the right way: by
>>> passing them to iconv(), which is designed to work with bytes in
>>> "unknown" encodings.
>>>
>>> In contrast, your use of arbitrary bytes with gsub() is invalid,
>>> because gsub() assumes that the strings match their declared encoding:
>>> UTF-8, Latin-1, or the native locale encoding. (See ?Encoding.)
>>>
>>> When you write "Ekstr\xf8m", you get a string that consists of Latin-1
>>> bytes but has the wrong encoding property set. Given this string,
>>> gsub() and friends will break on a UTF-8 system (because "r\xf8m" is
>>> not a valid UTF-8 sequence of bytes), while iconv() will not.
>>>
>>> Depending on the desired semantics of subNonStandardCharacters(), you
>>> might be able to avoid the failures with the useBytes argument, or you
>>> might silently return invalid data in some corner cases. Is the "x"
>>> argument supposed to be bytes in arbitrary encoding, or properly decoded
>>> characters that might include those that don't map to ASCII?
>>>
>>
>>       Wow.  So what's the recommended fix?
>>
>>
>>       If I understand correctly, "\u**" should work with ** being f8, f6,
>> df, or fc [all hex digits, I assume?].  However, "\u00**" may be
>> preferred over "\u**", and "\u{**}" may be better still.
> 
> That's all correct.
> 
>>       The blog that Tomas wrote might be more useful if it included a
>> recommendation like this.
> 
> I think that advice is too specific to this example.  Here's a short 
> explanation about what's going on:
> 
> Strings in R are assumed to be in the ASCII encoding if they have no 
> bytes bigger than 128 (hex 80).
> 
> If they do have such bytes, they can be marked as "latin1", or "UTF-8", 
> or not marked, in which case they are assumed to be in the local encoding.
> 
> If you write a string containing "\u....", it is marked as being in the 
> "utf8" encoding.
> 
> If you write a string containing "\x....", it is not marked.
> 
> Thus if you are writing strings for others to use, you don't know how 
> those strings will be interpreted unless you explicitly set their 
> encoding.  For example, this is ambiguous:
> 
>     x <- "fa\xE7ile"
> 
> This is not:
> 
>     x <- "fa\xE7ile"
>     Encoding(x) <- "latin1"
> 
> The advice you received to change the \x to \u works for your examples, 
> but might fail in other examples.  As help("Quotes") says, "\uE7" is the 
> Unicode code point hex E7, which is a c with a cedilla.  (The two hex 
> digit Unicode values from 80 to FF match the Latin-1 values; but not 
> everyone lives and works in a Latin-1 locale, so \xE7 might not be 
> equivalent to \uE7 for some people.)
> 
> You can have 1-4 hex digits after \u.  If the next character happens to 
> be a hex digit, you'll get some other character, e.g. "\uE7" is a ç (a c 
> with a cedilla), but "\uE7a" is a single Thai character, and "\uE7ab" is 
> some other single character (in the "private use area" of Unicode).
> 
> So it's safest to use exactly 4 hex digits as \u00E7, or to wrap the 
> value in curly braces, \u{E7}.

Thanks to all who replied.  I've changed all the "\x.." to "\u{..}".  I 
have other issues to deal with, but this seems to answer this question. 
  Thanks again, Spencer Graves

> 
> Some Unicode characters need more than 4 hex digits.  Use \U for those.
> 
> Duncan Murdoch
> 
> 
>