[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Tue Jan 31 10:53:21 CET 2023
On 1/31/23 09:48, Ivan Krylov wrote:
> Can we use the "bytes" encoding for such environment variables invalid
> in the current locale? The following patch preserves CE_NATIVE for
> strings valid in the current UTF-8 or multibyte locale (or
> non-multibyte strings) but sets CE_BYTES for those that are invalid:
>
> Index: src/main/sysutils.c
> ===================================================================
> --- src/main/sysutils.c (revision 83731)
> +++ src/main/sysutils.c (working copy)
> @@ -393,8 +393,16 @@
> char **e;
> for (i = 0, e = environ; *e != NULL; i++, e++);
> PROTECT(ans = allocVector(STRSXP, i));
> - for (i = 0, e = environ; *e != NULL; i++, e++)
> - SET_STRING_ELT(ans, i, mkChar(*e));
> + for (i = 0, e = environ; *e != NULL; i++, e++) {
> + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
> + known_to_be_utf8 ? CE_UTF8 :
> + CE_NATIVE;
> + if (
> + (utf8locale && !utf8Valid(*e))
> + || (mbcslocale && !mbcsValid(*e))
> + ) enc = CE_BYTES;
> + SET_STRING_ELT(ans, i, mkCharCE(*e, enc));
> + }
> #endif
> } else {
> PROTECT(ans = allocVector(STRSXP, i));
> @@ -416,11 +424,14 @@
> if (s == NULL)
> SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0));
> else {
> - SEXP tmp;
> - if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1);
> - else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8);
> - else tmp = mkChar(s);
> - SET_STRING_ELT(ans, j, tmp);
> + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
> + known_to_be_utf8 ? CE_UTF8 :
> + CE_NATIVE;
> + if (
> + (utf8locale && !utf8Valid(s))
> + || (mbcslocale && !mbcsValid(s))
> + ) enc = CE_BYTES;
> + SET_STRING_ELT(ans, j, mkCharCE(s, enc));
> }
> #endif
> }
>
> Here are the potential problems with this approach:
>
> * I don't know whether known_to_be_utf8 can disagree with utf8locale.
> known_to_be_utf8 was the original condition for setting CE_UTF8 on
> the string. I also need to detect non-UTF-8 multibyte locales, so
> I'm checking for utf8locale and mbcslocale. Perhaps I should be more
> careful and test for (enc == CE_UTF8) || (utf8locale && enc ==
> CE_NATIVE) instead of just utf8locale.
>
> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
> strings in the environment with this patch applied, but now
> print.Dlist does, because formatDL wants to compute the width of the
> string which has the 'bytes' encoding. If this is a good way to
> solve the problem, I can work on suggesting a fix for formatDL to
> avoid the error.
Thanks, indeed, type instability is a big problem of the approach "turn
invalid strings to bytes". It is something what is historically being
done in regular expression operations, but it is brittle and not user
friendly: writing code to be agnostic to whether we are dealing with
"bytes" or a regular string is very tedious. Pieces of type instability
come also from that ASCII strings are always flagged "native" (never
"bytes"). Last year I had to revert a big change which broke existing
code by introducing some more of this instability due to better dealing
with invalid strings in regular expressions. I've made some additions to
R-devel allowing to better deal with such instability but it is still a
pain and existing code has to be changed (and made more complicated).
So, I don't think this is the way to go.
Tomas
>
More information about the R-devel
mailing list