[Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

Tue Jan 31 10:53:21 CET 2023

On 1/31/23 09:48, Ivan Krylov wrote:
> Can we use the "bytes" encoding for such environment variables invalid
> in the current locale? The following patch preserves CE_NATIVE for
> strings valid in the current UTF-8 or multibyte locale (or
> non-multibyte strings) but sets CE_BYTES for those that are invalid:
>
> Index: src/main/sysutils.c
> ===================================================================
> --- src/main/sysutils.c	(revision 83731)
> +++ src/main/sysutils.c	(working copy)
> @@ -393,8 +393,16 @@
>   	char **e;
>   	for (i = 0, e = environ; *e != NULL; i++, e++);
>   	PROTECT(ans = allocVector(STRSXP, i));
> -	for (i = 0, e = environ; *e != NULL; i++, e++)
> -	    SET_STRING_ELT(ans, i, mkChar(*e));
> +	for (i = 0, e = environ; *e != NULL; i++, e++) {
> +	    cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
> +	        known_to_be_utf8 ? CE_UTF8 :
> +	        CE_NATIVE;
> +	    if (
> +	        (utf8locale && !utf8Valid(*e))
> +	        || (mbcslocale && !mbcsValid(*e))
> +	    ) enc = CE_BYTES;
> +	    SET_STRING_ELT(ans, i, mkCharCE(*e, enc));
> +	}
>   #endif
>       } else {
>   	PROTECT(ans = allocVector(STRSXP, i));
> @@ -416,11 +424,14 @@
>   	    if (s == NULL)
>   		SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0));
>   	    else {
> -		SEXP tmp;
> -		if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1);
> -		else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8);
> -		else tmp = mkChar(s);
> -		SET_STRING_ELT(ans, j, tmp);
> +		cetype_t enc = known_to_be_latin1 ? CE_LATIN1 :
> +		    known_to_be_utf8 ? CE_UTF8 :
> +		    CE_NATIVE;
> +		if (
> +		    (utf8locale && !utf8Valid(s))
> +		    || (mbcslocale && !mbcsValid(s))
> +		) enc = CE_BYTES;
> +		SET_STRING_ELT(ans, j, mkCharCE(s, enc));
>   	    }
>   #endif
>   	}
>
> Here are the potential problems with this approach:
>
>   * I don't know whether known_to_be_utf8 can disagree with utf8locale.
>     known_to_be_utf8 was the original condition for setting CE_UTF8 on
>     the string. I also need to detect non-UTF-8 multibyte locales, so
>     I'm checking for utf8locale and mbcslocale. Perhaps I should be more
>     careful and test for (enc == CE_UTF8) || (utf8locale && enc ==
>     CE_NATIVE) instead of just utf8locale.
>
>   * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
>     strings in the environment with this patch applied, but now
>     print.Dlist does, because formatDL wants to compute the width of the
>     string which has the 'bytes' encoding. If this is a good way to
>     solve the problem, I can work on suggesting a fix for formatDL to
>     avoid the error.

Thanks, indeed, type instability is a big problem of the approach "turn 
invalid strings to bytes". It is something what is historically being 
done in regular expression operations, but it is brittle and not user 
friendly: writing code to be agnostic to whether we are dealing with 
"bytes" or a regular string is very tedious. Pieces of type instability 
come also from that ASCII strings are always flagged "native" (never 
"bytes"). Last year I had to revert a big change which broke existing 
code by introducing some more of this instability due to better dealing 
with invalid strings in regular expressions. I've made some additions to 
R-devel allowing to better deal with such instability but it is still a 
pain and existing code has to be changed (and made more complicated).

So, I don't think this is the way to go.

Tomas

>