[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
Simon Urbanek
@|mon@urb@nek @end|ng |rom R-project@org
Tue Jan 31 06:01:07 CET 2023
Ah, sorry, this is my bad - I was trying on an older version of R and it seemed to work, but it was apparently my fault for not setting the environment variable correctly in that test, so I take the regression comment back (I had the recent encoding changes in mind so). I still think it may be worth thinking about what we want it to do - the inconsistency is still true:
> Sys.getenv("BOOM")
[1] "\xff"
> Sys.getenv()[["BOOM"]]
Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
In addition: Warning message:
In regexpr("=", x, fixed = TRUE) : input string 2 is invalid in this locale
so the rest of my comments stand ;).
Thanks,
Simon
> On Jan 31, 2023, at 2:11 PM, Henrik Bengtsson <henrik.bengtsson using gmail.com> wrote:
>
> Thanks for the quick replies.
>
> For me, the main issue is that the value of an environment variable
> can causes errors in R. Say that there's an environment variable that
> is used as a raw data blob (as Simon mentions) and where that data is
> flavor of the day. Then, this would affect R in a random,
> non-predictable way. Unless it can be argued that environment
> variables must not contain random bytes, I argue that R needs to be
> robust against this. If not, we're effectively asking everyone to use
> tryCatch() whenever calling Sys.getenv(). Even so, they wouldn't not
> be able to query *all* environment variables, or even know which they
> are, because Sys.getenv() gives an error. So, I think the only
> reasonable thing to do is to make Sys.getenv() robust against this.
>
> One alternative is to have Sys.getenv() (1) detect for which
> environment variables this happens, (2) produce an informative warning
> that R cannot represent the value correctly, and (3) either drop it,
> or return an adjusted value (e.g. NA_character_). Something like:
>
>> envs <- Sys.getenv()
> Warning message:
> Environment variable 'BOOM' was dropped, because its
> value cannot be represented as a string in R
>
> As you see from my examples below, R used to drop such environment
> variables in R (< 4.1.0).
>
> Simon, your comment saying it used to work, maybe me look back at
> older versions of R. Although I didn't say so, I used R 4.2.2 in my
> original report. It looks like it was in R 4.1.0 that this started to
> fail. I get:
>
> # R 2.15.0 - R 3.1.0
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']"
> <NA>
> NA
> Warning message:
> In strsplit(.Internal(Sys.getenv(character(), "")), "=", fixed = TRUE) :
> input string 8 is invalid in this locale
>
> # R 3.2.0 - R 4.0.4
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']"
> NA NA
> Warning message:
> In strsplit(.Internal(Sys.getenv(character(), "")), "=", fixed = TRUE) :
> input string 137 is invalid in this locale
>
> # R 4.1.0 - ...
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']"
> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
> Calls: Sys.getenv -> substring
> In addition: Warning message:
> In regexpr("=", x, fixed = TRUE) :
> input string 137 is invalid in this locale
> Execution halted
>
> /Henrik
>
> On Mon, Jan 30, 2023 at 4:27 PM Simon Urbanek
> <simon.urbanek using r-project.org> wrote:
>>
>> Tomas,
>>
>> I think you're not addressing the actual issue which is a clear regression in Sys.getenv() [because it used to work and still works for single env var, but not a list] and the cryptic error due to that regression (caused by changes in R-devel). So in either case, Sys.getenv needs fixing (i.e., this should really go to the bugzilla). Its behavior is currently inconsistent.
>>
>> The quoted Python discussion diverges very quickly into file name discussions, but I think that is not relevant here - environment variables are essentially data "blobs" that have application-specific meaning. They just chose to drop them because they didn't want to make a decision.
>>
>> I don't have a strong opinion on this, but Sys.getenv("FOO") and Sys.getenv()[["FOO"]] should not yield two different results. I would argue that if we want to make specific checks, we should make them conditional - even if the default is to add them. Again, the error is due to the implementation of Sys.getenv() breaking in R-devel, not due to any design decision.
>>
>> Cheers,
>> Simon
>>
>>
>>> On Jan 31, 2023, at 1:04 PM, Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
>>>
>>>
>>> On 1/30/23 23:01, Henrik Bengtsson wrote:
>>>> /Hello.
>>>>
>>>> SUMMARY:
>>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
>>>> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
>>>>
>>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
>>>> [1] "\xff"
>>>>
>>>> BACKGROUND:
>>>> I launch R through an Son of Grid Engine (SGE) scheduler, where the R
>>>> process is launched on a compute host via 'qrsh', which part of SGE.
>>>> Without going into details, 'mpirun' is also involved. Regardless, in
>>>> this process, an 'qrsh'-specific environment variable 'QRSH_COMMAND'
>>>> is set automatically. The value of this variable comprise of a string
>>>> with \xff (ASCII 255) injected between the words. This is by design
>>>> of SGE [1]. Here is an example of what this environment variable may
>>>> look like:
>>>>
>>>> QRSH_COMMAND= orted\xff--hnp-topo-sig\xff2N:2S:32L3:128L2:128L1:128C:256H:x86_64\xff-mca\xffess\xff\"env\"\xff-mca\xfforte_ess_jobid\xff\"3473342464\"\xff-mca\xfforte_ess_vpid\xff1\xff-mca\xfforte_ess_num_procs\xff\"3\"\xff-mca\xfforte_hnp_uri\xff\"3473342464.0;tcp://192.168.1.13:50847\"\xff-mca\xffplm\xff\"rsh\"\xff-mca\xfforte_tag_output\xff\"1\"\xff--tree-spawn"
>>>>
>>>> where each \xff is a single byte 255=0xFF=\xFF.
>>>>
>>>>
>>>> ISSUE:
>>>> An environment variable with embedded 0xFF bytes in its value causes
>>>> calls to Sys.getenv() to produce an error when running R in a UTF-8
>>>> locale. Here is a minimal example on Linux:
>>>>
>>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
>>>> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
>>>> Calls: Sys.getenv -> substring
>>>> In addition: Warning message:
>>>> In regexpr("=", x, fixed = TRUE) :
>>>> input string 134 is invalid in this locale
>>>> Execution halted
>>>>
>>>>
>>>> WORKAROUND:
>>>> The workaround is to (1) identify any environment variables with
>>>> invalid UTF-8 symbols, and (2) prune or unset those variables before
>>>> launching R, e.g. in my SGE case, launching R using:
>>>>
>>>> QRSH_COMMAND= Rscript --vanilla -e "Sys.getenv()"
>>>>
>>>> avoid the problem. Having to unset/modify environment variables
>>>> because R doesn't like them, see a bit of an ad-hoc hack to me. Also,
>>>> if you are not aware of this problem, or not a savvy R user, it can be
>>>> quite tricky to track down the above error message, especially if
>>>> Sys.getenv() is called deep down in some package dependency.
>>>>
>>>>
>>>> DISCUSSION/SUGGESTION/ASK:
>>>> My suggestion would be to make Sys.getenv() robust against any type of
>>>> byte values in environment variable strings.
>>>>
>>>> The error occurs in Sys.getenv() from:
>>>>
>>>> x <- .Internal(Sys.getenv(character(), ""))
>>>> m <- regexpr("=", x, fixed = TRUE) ## produces a warning
>>>> n <- substring(x, 1L, m - 1L)
>>>> v <- substring(x, m + 1L) ## produces the error
>>>>
>>>> I know too little about string encodings, so I'm not sure what the
>>>> best approach would be here, but maybe falling back to parsing strings
>>>> that are invalid in the current locale using the C locale would be
>>>> reasonable? Maybe Sys.getenv() should always use the C locale for
>>>> this. It looks like Sys.getenv(name) does this, e.g.
>>>>
>>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
>>>> [1] "\xff"
>>>>
>>>> I'd appreciate any comments and suggestions. I'm happy to file a bug
>>>> report on BugZilla, if this is a bug.
>>> This discussion comes from Python: https://bugs.python.org/issue4006
>>> (it says Python skips such environment variables)
>>>
>>> The problem of invalid strings in environment variables is a similar to the problem of invalid strings in file names. Both variables and file names are something people want to use as strings in programs, scripts, texts, but at the same time these may in theory not be valid strings. Working with (potentially) invalid strings (almost) transparently is much harder than with valid strings; even if R decided to do that, it would be hard to implement and take long and only work for some operations, most will still throw errors. In addition, in practice invalid strings are almost always due to an error, particularly so in file names or environment variables. Such errors are often worth catching (wrong encoding declaration, etc), even though perhaps not always.
>>>
>>> In practice, this instance can only be properly fixed at the source, [1] should not do this. split_command() will run into problems with different software, not just R.
>>>
>>> There should be a way to split the commands in ASCII (using some sort of quoting/escaping). Using \xFF is flawed also simply because it may be present in the commands, if we followed the same logic of that every byte is fine. So the code is buggy even regardless of multi-byte encodings.
>>>
>>> Re difficulty to debug, I think the error message is clear and if packages catch and hide errors, that'd be bad design of such packages, R couldn't really do much about that. This needs to be fixed at [1].
>>>
>>> Tomas
>>>
>>>>
>>>> Henrik
>>>>
>>>> [1] https://github.com/gridengine/gridengine/blob/master/source/clients/qrsh/qrsh_starter.c#L462-L466
>>>>
>>>> ______________________________________________
>>>> R-devel using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
More information about the R-devel
mailing list