[R-pkg-devel] Warning... unable to translate 'Ekstr<f8>m' to a wide string; Error... input string 1 is invalid
Spencer Graves
@pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Wed Jul 20 02:56:45 CEST 2022
On 7/19/22 2:12 PM, Duncan Murdoch wrote:
> On 19/07/2022 2:23 p.m., Spencer Graves wrote:
>> Hi, Ivan et al.:
>>
>>
>> On 7/19/22 1:03 PM, Ivan Krylov wrote:
>>> On Tue, 19 Jul 2022 12:32:20 -0500
>>> Spencer Graves <spencer.graves using effectivedefense.org> wrote:
>>>
>>>> Can someone provide me with a link to the correct development
>>>> version of help('iconv')? The current version includes the exact
>>>> offending "\x" strings that I have.
>>>
>>> http://svn.r-project.org/R/trunk/src/library/base/man/iconv.Rd
>>>
>>> It still does, because it works with byte strings the right way: by
>>> passing them to iconv(), which is designed to work with bytes in
>>> "unknown" encodings.
>>>
>>> In contrast, your use of arbitrary bytes with gsub() is invalid,
>>> because gsub() assumes that the strings match their declared encoding:
>>> UTF-8, Latin-1, or the native locale encoding. (See ?Encoding.)
>>>
>>> When you write "Ekstr\xf8m", you get a string that consists of Latin-1
>>> bytes but has the wrong encoding property set. Given this string,
>>> gsub() and friends will break on a UTF-8 system (because "r\xf8m" is
>>> not a valid UTF-8 sequence of bytes), while iconv() will not.
>>>
>>> Depending on the desired semantics of subNonStandardCharacters(), you
>>> might be able to avoid the failures with the useBytes argument, or you
>>> might silently return invalid data in some corner cases. Is the "x"
>>> argument supposed to be bytes in arbitrary encoding, or properly decoded
>>> characters that might include those that don't map to ASCII?
>>>
>>
>> Wow. So what's the recommended fix?
>>
>>
>> If I understand correctly, "\u**" should work with ** being f8, f6,
>> df, or fc [all hex digits, I assume?]. However, "\u00**" may be
>> preferred over "\u**", and "\u{**}" may be better still.
>
> That's all correct.
>
>> The blog that Tomas wrote might be more useful if it included a
>> recommendation like this.
>
> I think that advice is too specific to this example. Here's a short
> explanation about what's going on:
>
> Strings in R are assumed to be in the ASCII encoding if they have no
> bytes bigger than 128 (hex 80).
>
> If they do have such bytes, they can be marked as "latin1", or "UTF-8",
> or not marked, in which case they are assumed to be in the local encoding.
>
> If you write a string containing "\u....", it is marked as being in the
> "utf8" encoding.
>
> If you write a string containing "\x....", it is not marked.
>
> Thus if you are writing strings for others to use, you don't know how
> those strings will be interpreted unless you explicitly set their
> encoding. For example, this is ambiguous:
>
> x <- "fa\xE7ile"
>
> This is not:
>
> x <- "fa\xE7ile"
> Encoding(x) <- "latin1"
>
> The advice you received to change the \x to \u works for your examples,
> but might fail in other examples. As help("Quotes") says, "\uE7" is the
> Unicode code point hex E7, which is a c with a cedilla. (The two hex
> digit Unicode values from 80 to FF match the Latin-1 values; but not
> everyone lives and works in a Latin-1 locale, so \xE7 might not be
> equivalent to \uE7 for some people.)
>
> You can have 1-4 hex digits after \u. If the next character happens to
> be a hex digit, you'll get some other character, e.g. "\uE7" is a ç (a c
> with a cedilla), but "\uE7a" is a single Thai character, and "\uE7ab" is
> some other single character (in the "private use area" of Unicode).
>
> So it's safest to use exactly 4 hex digits as \u00E7, or to wrap the
> value in curly braces, \u{E7}.
Thanks to all who replied. I've changed all the "\x.." to "\u{..}". I
have other issues to deal with, but this seems to answer this question.
Thanks again, Spencer Graves
>
> Some Unicode characters need more than 4 hex digits. Use \U for those.
>
> Duncan Murdoch
>
>
>
More information about the R-package-devel
mailing list