[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Tue Jan 31 15:34:39 CET 2023

On 1/31/23 14:37, peter dalgaard wrote:
>
>> On 31 Jan 2023, at 12:51 , Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
>>
>>
>> On 1/31/23 11:50, Martin Maechler wrote:
> <snippage>
>>> hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
>>> but of course you are right that type stability is really a
>>> valid goal....
>>>
>>> In general, what about behaving close to "old R" and replace all such
>>> strings by  NA_character_  (and typically raising one warning)?
>>> This would keep the result a valid character vector, just with some NA entries.
>>>
>>> Specifically for  Sys.getenv(),  I still think Simon has a very
>>> valid point of "requiring" (of our design) that
>>> Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
>>> Sys.getenv("BOOM")
>>>
>>> Also, as typical R user, I'd definitely want to be able to get all the valid
>>> environment variables, even if there are one or more invalid
>>> ones. ... and similarly in other cases, it may be a cheap
>>> strategy to replace invalid strings ("string" in the sense of
>>> length 1 STRSXP, i.e., in R, a "character" of length 1) by
>>> NA_character_  and keep all valid parts of the character vector
>>> in a valid encoding.
>> In case of specifically getenv(), yes, we could return NA for variables containing invalid strings, both when obtaining a value for a single variable and for multiple, partially matching undocumented and unintentional behavior R had before 4.1, and making getenv(var) and getenv()[[var]] consistent even with invalid strings.  Once we decide on how to deal with invalid strings in general, we can change this again accordingly, breaking code for people who depend on these things (but so far I only know about this one case). Perhaps this would be a logical alternative to the Python approach that would be more suitable for R (given we have NAs and given that we happened to have that somewhat similar alternative before). Conceptually it is about the same thing as omitting the variable in Python: R users would not be able to use such variables, but if they don't touch them, they could be inherited to child processes, etc.
> <more snippage>
>
> Hum, I'm out of my waters here, but offhand I would be wary about approaches that lead to loss of information. Presumably someone will sooner or later actually want to deal with the content of an environment variable with invalid bytes inside. I.e. it would be preferable to keep the content and mark the encoding as something not-multibyte.
>
> In fact this is almost what happens (for me...) if I just add Encoding(x) <- "bytes" for the return value of .Internal(Sys.getenv(character(), "")):
>
>> Sys.getenv()[["BOOM"]]
> [1] "\\xff"
>> Encoding(Sys.getenv())
>   [1] "unknown" "unknown" "bytes"   "unknown" "unknown" "unknown" "unknown"
>   [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
> ...
>
> but I suppose that breaks if I have environment variables that actually _are_ utf8, because only plain-ASCII becomes "unknown"? And nchar(Sys.getenv()) also does not work.

Yes, that way you would get even valid UTF-8 strings represented as 
"bytes". But that is not the main problem. It would technically be 
possible to keep valid strings in the native encoding (typically UTF-8) 
as "native", but only those invalid as "bytes", as Ivan also suggested.

But the key problem is that it breaks because of that "type 
instability": other string functions later will start failing on 
"bytes", resulting in a mess.

We could provide new API (e.g. argument "useBytes=TRUE") which would 
provide all variables as "bytes" (all-ASCII would be "native" per how 
"bytes" works) and let the users decide whether they want to use iconv() 
to turn some of them into strings, how to do such conversions (e.g. 
error, warn, substitute NA, substitute something else). That would allow 
working with such variables. That would be probably a "clean" solution 
at least for POSIX system but I doubt anyone would use that. For Windows 
it would still be questionable (due to the two environment profiles in 
two different encodings, which may not match).

> (And of course I agree that the QRSH thing is Just Wrong; people using 0xff as a separator between utf8 strings deserve the same fate as those who used comma separation between numbers with decimal commas.)

Indeed.

And then I am afraid I have to make my position stronger based on 
reading more what Windows do. Their API clearly implies that variables 
are strings, because it automatically converts them between "wide" and 
"multi-byte". An application can have both of these profiles, then the C 
runtime manages both, and they may get out of sync and the documentation 
explicitly warns about confusion due to that some characters cannot be 
converted to some encodings.

Also, the Windows approach that environment values are strings is 
"compatible" with that different applications on Windows may and often 
do use different native encoding (some use UTF-8 such as R, but some use 
the legacy encoding, e.g. Latin1, but a multi-byte encoding for other 
languages). Imagine that an application running in the legacy encoding 
sets an environment variable to a valid non-ASCII string. And then you 
run R and it tries to read that variable. It works due to encoding 
conversions. When the strings are valid.... (and yes, when the mapping 
is 1-1, but that's another matter)

So, at least on Windows, environment variables clearly are strings, not 
blobs.

Tomas
>
> -pd
>