[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Tue Jan 31 12:23:08 CET 2023

On 31/01/2023 5:50 a.m., Martin Maechler wrote:
>>>>>> Tomas Kalibera
>>>>>>      on Tue, 31 Jan 2023 10:53:21 +0100 writes:
> 
>      > On 1/31/23 09:48, Ivan Krylov wrote:
>      >> Can we use the "bytes" encoding for such environment variables invalid
>      >> in the current locale? The following patch preserves CE_NATIVE for
>      >> strings valid in the current UTF-8 or multibyte locale (or
>      >> non-multibyte strings) but sets CE_BYTES for those that are invalid:
>      >>
>      >> Index: src/main/sysutils.c
>      >> ===================================================================
>      >> --- src/main/sysutils.c	(revision 83731)
> .....
>      >>
>      >> Here are the potential problems with this approach:
>      >>
>      >> * I don't know whether known_to_be_utf8 can disagree with utf8locale.
>      >> known_to_be_utf8 was the original condition for setting CE_UTF8 on
>      >> the string. I also need to detect non-UTF-8 multibyte locales, so
>      >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more
>      >> careful and test for (enc == CE_UTF8) || (utf8locale && enc ==
>      >> CE_NATIVE) instead of just utf8locale.
>      >>
>      >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
>      >> strings in the environment with this patch applied, but now
>      >> print.Dlist does, because formatDL wants to compute the width of the
>      >> string which has the 'bytes' encoding. If this is a good way to
>      >> solve the problem, I can work on suggesting a fix for formatDL to
>      >> avoid the error.
> 
>      > Thanks, indeed, type instability is a big problem of the approach "turn
>      > invalid strings to bytes". It is something what is historically being
>      > done in regular expression operations, but it is brittle and not user
>      > friendly: writing code to be agnostic to whether we are dealing with
>      > "bytes" or a regular string is very tedious. Pieces of type instability
>      > come also from that ASCII strings are always flagged "native" (never
>      > "bytes"). Last year I had to revert a big change which broke existing
>      > code by introducing some more of this instability due to better dealing
>      > with invalid strings in regular expressions. I've made some additions to
>      > R-devel allowing to better deal with such instability but it is still a
>      > pain and existing code has to be changed (and made more complicated).
> 
>      > So, I don't think this is the way to go.
> 
>      > Tomas
> 
> hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
> but of course you are right that type stability is really a
> valid goal....
> 
> In general, what about behaving close to "old R" and replace all such
> strings by  NA_character_  (and typically raising one warning)?
> This would keep the result a valid character vector, just with some NA entries.
> 
> Specifically for  Sys.getenv(),  I still think Simon has a very
> valid point of "requiring" (of our design) that
> Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
> Sys.getenv("BOOM")
> 
> Also, as typical R user, I'd definitely want to be able to get all the valid
> environment variables, even if there are one or more invalid
> ones. ... and similarly in other cases, it may be a cheap
> strategy to replace invalid strings ("string" in the sense of
> length 1 STRSXP, i.e., in R, a "character" of length 1) by
> NA_character_  and keep all valid parts of the character vector
> in a valid encoding.
> 
> - - - - - -
> 
> In addition to the above cheap "replace by NA"-strategy,
> and at least once R is "all UTF-8", we could
> also consider a more expensive strategy that would try to
> replace invalid characters/byte-sequences by one specific valid UTF-8
> character, i.e., glyph (think of a special version of "?") that
> we would designate as replacement for "invalid-in-current-encoding".
> 
> Probably such a glyph already exists and we have seen it used in
> some console's output when having to print such things.

I think it is U+FFFD, described here:

https://www.fileformat.info/info/unicode/char/fffd/index.htm

Duncan Murdoch