[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF
Ivan Krylov
kry|ov@r00t @end|ng |rom gm@||@com
Tue Jan 31 09:48:20 CET 2023
Can we use the "bytes" encoding for such environment variables invalid
in the current locale? The following patch preserves CE_NATIVE for
strings valid in the current UTF-8 or multibyte locale (or
non-multibyte strings) but sets CE_BYTES for those that are invalid:
Index: src/main/sysutils.c
===================================================================
--- src/main/sysutils.c (revision 83731)
+++ src/main/sysutils.c (working copy)
@@ -393,8 +393,16 @@
char **e;
for (i = 0, e = environ; *e != NULL; i++, e++);
PROTECT(ans = allocVector(STRSXP, i));
- for (i = 0, e = environ; *e != NULL; i++, e++)
- SET_STRING_ELT(ans, i, mkChar(*e));
+ for (i = 0, e = environ; *e != NULL; i++, e++) {
+ cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
+ known_to_be_utf8 ? CE_UTF8 :
+ CE_NATIVE;
+ if (
+ (utf8locale && !utf8Valid(*e))
+ || (mbcslocale && !mbcsValid(*e))
+ ) enc = CE_BYTES;
+ SET_STRING_ELT(ans, i, mkCharCE(*e, enc));
+ }
#endif
} else {
PROTECT(ans = allocVector(STRSXP, i));
@@ -416,11 +424,14 @@
if (s == NULL)
SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0));
else {
- SEXP tmp;
- if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1);
- else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8);
- else tmp = mkChar(s);
- SET_STRING_ELT(ans, j, tmp);
+ cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
+ known_to_be_utf8 ? CE_UTF8 :
+ CE_NATIVE;
+ if (
+ (utf8locale && !utf8Valid(s))
+ || (mbcslocale && !mbcsValid(s))
+ ) enc = CE_BYTES;
+ SET_STRING_ELT(ans, j, mkCharCE(s, enc));
}
#endif
}
Here are the potential problems with this approach:
* I don't know whether known_to_be_utf8 can disagree with utf8locale.
known_to_be_utf8 was the original condition for setting CE_UTF8 on
the string. I also need to detect non-UTF-8 multibyte locales, so
I'm checking for utf8locale and mbcslocale. Perhaps I should be more
careful and test for (enc == CE_UTF8) || (utf8locale && enc ==
CE_NATIVE) instead of just utf8locale.
* I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
strings in the environment with this patch applied, but now
print.Dlist does, because formatDL wants to compute the width of the
string which has the 'bytes' encoding. If this is a good way to
solve the problem, I can work on suggesting a fix for formatDL to
avoid the error.
--
Best regards,
Ivan
More information about the R-devel
mailing list