[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Tue Jan 31 14:37:02 CET 2023

> On 31 Jan 2023, at 12:51 , Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
> 
> 
> On 1/31/23 11:50, Martin Maechler wrote:
<snippage>
>> hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
>> but of course you are right that type stability is really a
>> valid goal....
>> 
>> In general, what about behaving close to "old R" and replace all such
>> strings by  NA_character_  (and typically raising one warning)?
>> This would keep the result a valid character vector, just with some NA entries.
>> 
>> Specifically for  Sys.getenv(),  I still think Simon has a very
>> valid point of "requiring" (of our design) that
>> Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
>> Sys.getenv("BOOM")
>> 
>> Also, as typical R user, I'd definitely want to be able to get all the valid
>> environment variables, even if there are one or more invalid
>> ones. ... and similarly in other cases, it may be a cheap
>> strategy to replace invalid strings ("string" in the sense of
>> length 1 STRSXP, i.e., in R, a "character" of length 1) by
>> NA_character_  and keep all valid parts of the character vector
>> in a valid encoding.
> In case of specifically getenv(), yes, we could return NA for variables containing invalid strings, both when obtaining a value for a single variable and for multiple, partially matching undocumented and unintentional behavior R had before 4.1, and making getenv(var) and getenv()[[var]] consistent even with invalid strings.  Once we decide on how to deal with invalid strings in general, we can change this again accordingly, breaking code for people who depend on these things (but so far I only know about this one case). Perhaps this would be a logical alternative to the Python approach that would be more suitable for R (given we have NAs and given that we happened to have that somewhat similar alternative before). Conceptually it is about the same thing as omitting the variable in Python: R users would not be able to use such variables, but if they don't touch them, they could be inherited to child processes, etc.
<more snippage>

Hum, I'm out of my waters here, but offhand I would be wary about approaches that lead to loss of information. Presumably someone will sooner or later actually want to deal with the content of an environment variable with invalid bytes inside. I.e. it would be preferable to keep the content and mark the encoding as something not-multibyte.

In fact this is almost what happens (for me...) if I just add Encoding(x) <- "bytes" for the return value of .Internal(Sys.getenv(character(), "")):

> Sys.getenv()[["BOOM"]]
[1] "\\xff"
> Encoding(Sys.getenv())
 [1] "unknown" "unknown" "bytes"   "unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
...

but I suppose that breaks if I have environment variables that actually _are_ utf8, because only plain-ASCII becomes "unknown"? And nchar(Sys.getenv()) also does not work.

(And of course I agree that the QRSH thing is Just Wrong; people using 0xff as a separator between utf8 strings deserve the same fate as those who used comma separation between numbers with decimal commas.)

-pd

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com