[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Tue Jan 31 12:51:56 CET 2023

On 1/31/23 11:50, Martin Maechler wrote:
>>>>>> Tomas Kalibera
>>>>>>      on Tue, 31 Jan 2023 10:53:21 +0100 writes:
>      > On 1/31/23 09:48, Ivan Krylov wrote:
>      >> Can we use the "bytes" encoding for such environment variables invalid
>      >> in the current locale? The following patch preserves CE_NATIVE for
>      >> strings valid in the current UTF-8 or multibyte locale (or
>      >> non-multibyte strings) but sets CE_BYTES for those that are invalid:
>      >>
>      >> Index: src/main/sysutils.c
>      >> ===================================================================
>      >> --- src/main/sysutils.c	(revision 83731)
> .....
>      >>
>      >> Here are the potential problems with this approach:
>      >>
>      >> * I don't know whether known_to_be_utf8 can disagree with utf8locale.
>      >> known_to_be_utf8 was the original condition for setting CE_UTF8 on
>      >> the string. I also need to detect non-UTF-8 multibyte locales, so
>      >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more
>      >> careful and test for (enc == CE_UTF8) || (utf8locale && enc ==
>      >> CE_NATIVE) instead of just utf8locale.
>      >>
>      >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
>      >> strings in the environment with this patch applied, but now
>      >> print.Dlist does, because formatDL wants to compute the width of the
>      >> string which has the 'bytes' encoding. If this is a good way to
>      >> solve the problem, I can work on suggesting a fix for formatDL to
>      >> avoid the error.
>
>      > Thanks, indeed, type instability is a big problem of the approach "turn
>      > invalid strings to bytes". It is something what is historically being
>      > done in regular expression operations, but it is brittle and not user
>      > friendly: writing code to be agnostic to whether we are dealing with
>      > "bytes" or a regular string is very tedious. Pieces of type instability
>      > come also from that ASCII strings are always flagged "native" (never
>      > "bytes"). Last year I had to revert a big change which broke existing
>      > code by introducing some more of this instability due to better dealing
>      > with invalid strings in regular expressions. I've made some additions to
>      > R-devel allowing to better deal with such instability but it is still a
>      > pain and existing code has to be changed (and made more complicated).
>
>      > So, I don't think this is the way to go.
>
>      > Tomas
>
> hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
> but of course you are right that type stability is really a
> valid goal....
>
> In general, what about behaving close to "old R" and replace all such
> strings by  NA_character_  (and typically raising one warning)?
> This would keep the result a valid character vector, just with some NA entries.
>
> Specifically for  Sys.getenv(),  I still think Simon has a very
> valid point of "requiring" (of our design) that
> Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
> Sys.getenv("BOOM")
>
> Also, as typical R user, I'd definitely want to be able to get all the valid
> environment variables, even if there are one or more invalid
> ones. ... and similarly in other cases, it may be a cheap
> strategy to replace invalid strings ("string" in the sense of
> length 1 STRSXP, i.e., in R, a "character" of length 1) by
> NA_character_  and keep all valid parts of the character vector
> in a valid encoding.
In case of specifically getenv(), yes, we could return NA for variables 
containing invalid strings, both when obtaining a value for a single 
variable and for multiple, partially matching undocumented and 
unintentional behavior R had before 4.1, and making getenv(var) and 
getenv()[[var]] consistent even with invalid strings.  Once we decide on 
how to deal with invalid strings in general, we can change this again 
accordingly, breaking code for people who depend on these things (but so 
far I only know about this one case). Perhaps this would be a logical 
alternative to the Python approach that would be more suitable for R 
(given we have NAs and given that we happened to have that somewhat 
similar alternative before). Conceptually it is about the same thing as 
omitting the variable in Python: R users would not be able to use such 
variables, but if they don't touch them, they could be inherited to 
child processes, etc.

> - - - - - -
>
> In addition to the above cheap "replace by NA"-strategy,
> and at least once R is "all UTF-8", we could
> also consider a more expensive strategy that would try to
> replace invalid characters/byte-sequences by one specific valid UTF-8
> character, i.e., glyph (think of a special version of "?") that
> we would designate as replacement for "invalid-in-current-encoding".
>
> Probably such a glyph already exists and we have seen it used in
> some console's output when having to print such things.

This is part of the big discussion where we couldn't find a consensus, 
and I think this happened so recently that it is too early for opening 
it again. Also, we are not yet to where we could assume UTF-8 is always 
the native encoding. As you know my currently preferred solution is to 
allow invalid portions of UTF-8 and treat them in individual functions: 
often one can proceed with invalid UTF-8 bytes as usual (e.g. copying, 
concatenation, search), in some function one has to use a special 
algorithm (e.g. regular expressions, PCRE2 supports that, in some cases 
one would substitute characters, in some cases one would throw an 
error). As strings are immutable in R, validity checks could be done 
once for each string. This would in many cases allow for UTF-8 strings 
with bits of errors to "go through" R without triggering errors. We 
didn't have such options with other multi-byte encodings, but UTF-8 is 
designed so that this is possible.

Substituting by various escapes "<U+..>", "??", etc, is something we 
already do in R in some cases, but it cannot be the default mechanism 
for dealing with invalid strings, e.g. file names are a prime example of 
where it doesn't work and indeed there are libraries for dealing with 
file names as special entities.

As you probably know part of the problem is also that invalid strings 
that are "tolerated" by OSes, file names and environment variables, can 
"originally" be in different "encodings". Sorry for these many quotes, 
but simplifying a lot on Unix file systems often uses byte-sequences for 
file names (even though applications often assume it is UTF-8) and on 
Windows sequences of 16-bit words (sometimes understood as UCS-2, but 
not necessarily valid UTF-16). With environment variables, this came so 
far that you can have a narrow and wide version which may not directly 
correspond (and I've not investigated the details yet, see documentation 
for _wenviron etc). In R, particularly on Windows, we can never get 
completely rid of the UTF-16 and the wide character interface, so in 
some cases, we would want to "convert invalid strings" between 
encodings, which of course doesn't make sense and most likely will be 
the case where invalid strings would cause errors. These things are so 
complicated that, while it is not my currently preferred solution, I 
understand the position that we ban invalid strings fully and keep them 
banned in R even when UTF-8 is the only supported native encoding.

Tomas