[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Tue Jan 31 12:51:56 CET 2023
On 1/31/23 11:50, Martin Maechler wrote:
>>>>>> Tomas Kalibera
>>>>>> on Tue, 31 Jan 2023 10:53:21 +0100 writes:
> > On 1/31/23 09:48, Ivan Krylov wrote:
> >> Can we use the "bytes" encoding for such environment variables invalid
> >> in the current locale? The following patch preserves CE_NATIVE for
> >> strings valid in the current UTF-8 or multibyte locale (or
> >> non-multibyte strings) but sets CE_BYTES for those that are invalid:
> >>
> >> Index: src/main/sysutils.c
> >> ===================================================================
> >> --- src/main/sysutils.c (revision 83731)
> .....
> >>
> >> Here are the potential problems with this approach:
> >>
> >> * I don't know whether known_to_be_utf8 can disagree with utf8locale.
> >> known_to_be_utf8 was the original condition for setting CE_UTF8 on
> >> the string. I also need to detect non-UTF-8 multibyte locales, so
> >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more
> >> careful and test for (enc == CE_UTF8) || (utf8locale && enc ==
> >> CE_NATIVE) instead of just utf8locale.
> >>
> >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
> >> strings in the environment with this patch applied, but now
> >> print.Dlist does, because formatDL wants to compute the width of the
> >> string which has the 'bytes' encoding. If this is a good way to
> >> solve the problem, I can work on suggesting a fix for formatDL to
> >> avoid the error.
>
> > Thanks, indeed, type instability is a big problem of the approach "turn
> > invalid strings to bytes". It is something what is historically being
> > done in regular expression operations, but it is brittle and not user
> > friendly: writing code to be agnostic to whether we are dealing with
> > "bytes" or a regular string is very tedious. Pieces of type instability
> > come also from that ASCII strings are always flagged "native" (never
> > "bytes"). Last year I had to revert a big change which broke existing
> > code by introducing some more of this instability due to better dealing
> > with invalid strings in regular expressions. I've made some additions to
> > R-devel allowing to better deal with such instability but it is still a
> > pain and existing code has to be changed (and made more complicated).
>
> > So, I don't think this is the way to go.
>
> > Tomas
>
> hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
> but of course you are right that type stability is really a
> valid goal....
>
> In general, what about behaving close to "old R" and replace all such
> strings by NA_character_ (and typically raising one warning)?
> This would keep the result a valid character vector, just with some NA entries.
>
> Specifically for Sys.getenv(), I still think Simon has a very
> valid point of "requiring" (of our design) that
> Sys.getenv()[["BOOM"]] {double `[[`} should be the same as
> Sys.getenv("BOOM")
>
> Also, as typical R user, I'd definitely want to be able to get all the valid
> environment variables, even if there are one or more invalid
> ones. ... and similarly in other cases, it may be a cheap
> strategy to replace invalid strings ("string" in the sense of
> length 1 STRSXP, i.e., in R, a "character" of length 1) by
> NA_character_ and keep all valid parts of the character vector
> in a valid encoding.
In case of specifically getenv(), yes, we could return NA for variables
containing invalid strings, both when obtaining a value for a single
variable and for multiple, partially matching undocumented and
unintentional behavior R had before 4.1, and making getenv(var) and
getenv()[[var]] consistent even with invalid strings. Once we decide on
how to deal with invalid strings in general, we can change this again
accordingly, breaking code for people who depend on these things (but so
far I only know about this one case). Perhaps this would be a logical
alternative to the Python approach that would be more suitable for R
(given we have NAs and given that we happened to have that somewhat
similar alternative before). Conceptually it is about the same thing as
omitting the variable in Python: R users would not be able to use such
variables, but if they don't touch them, they could be inherited to
child processes, etc.
> - - - - - -
>
> In addition to the above cheap "replace by NA"-strategy,
> and at least once R is "all UTF-8", we could
> also consider a more expensive strategy that would try to
> replace invalid characters/byte-sequences by one specific valid UTF-8
> character, i.e., glyph (think of a special version of "?") that
> we would designate as replacement for "invalid-in-current-encoding".
>
> Probably such a glyph already exists and we have seen it used in
> some console's output when having to print such things.
This is part of the big discussion where we couldn't find a consensus,
and I think this happened so recently that it is too early for opening
it again. Also, we are not yet to where we could assume UTF-8 is always
the native encoding. As you know my currently preferred solution is to
allow invalid portions of UTF-8 and treat them in individual functions:
often one can proceed with invalid UTF-8 bytes as usual (e.g. copying,
concatenation, search), in some function one has to use a special
algorithm (e.g. regular expressions, PCRE2 supports that, in some cases
one would substitute characters, in some cases one would throw an
error). As strings are immutable in R, validity checks could be done
once for each string. This would in many cases allow for UTF-8 strings
with bits of errors to "go through" R without triggering errors. We
didn't have such options with other multi-byte encodings, but UTF-8 is
designed so that this is possible.
Substituting by various escapes "<U+..>", "??", etc, is something we
already do in R in some cases, but it cannot be the default mechanism
for dealing with invalid strings, e.g. file names are a prime example of
where it doesn't work and indeed there are libraries for dealing with
file names as special entities.
As you probably know part of the problem is also that invalid strings
that are "tolerated" by OSes, file names and environment variables, can
"originally" be in different "encodings". Sorry for these many quotes,
but simplifying a lot on Unix file systems often uses byte-sequences for
file names (even though applications often assume it is UTF-8) and on
Windows sequences of 16-bit words (sometimes understood as UCS-2, but
not necessarily valid UTF-16). With environment variables, this came so
far that you can have a narrow and wide version which may not directly
correspond (and I've not investigated the details yet, see documentation
for _wenviron etc). In R, particularly on Windows, we can never get
completely rid of the UTF-16 and the wide character interface, so in
some cases, we would want to "convert invalid strings" between
encodings, which of course doesn't make sense and most likely will be
the case where invalid strings would cause errors. These things are so
complicated that, while it is not my currently preferred solution, I
understand the position that we ban invalid strings fully and keep them
banned in R even when UTF-8 is the only supported native encoding.
Tomas
More information about the R-devel
mailing list