[Rd] deparse() and UTF-8 strings

Mon Feb 21 14:17:49 CET 2022

I'm not R-core, but happen to have run into this issue.

I think this makes sense conceptually, and have had the same thought 
myself.  One implementation challenge is that the parser has a special 
branch for Unicode escape strings (e.g. "G\u00e1bor") that limits such 
input to 10K wide characters, so the parser would need to be modified in 
order to make this a general solution:

 > parse(text=sprintf('"%s"', strrep("G\\u00e1bor", 2000)))
Error in parse(text = sprintf("\"%s\"", strrep("G\\u00e1bor", 2000))) :
   string at line 1 containing Unicode escapes not in this locale
is too long (max 10000 chars)

Such strings are rare so maybe an interim solution is just to allow it 
for deparsing of shorter strings.  The parser modification itself would 
also have the benefit of speeding up parsing of strings without Unicode 
escapes.

Best,

B.

On 2/21/22 5:33 AM, Gábor Csárdi wrote:
> I am wondering if it would make sense to produce \u escaped strings in
> deparse() for UTF-8 input. Currently we have (in R-devel):
> 
> x <- "G\u00e1bor"
> Sys.setlocale("LC_ALL", "C")
> #> [1] "C/C/C/C/C/en_US.UTF-8"
> 
> deparse(x)
> #> [1] "\"G<U+00E1>bor\""
> 
> charToRaw(deparse(x))
> #> [1] 22 47 3c 55 2b 30 30 45 31 3e 62 6f 72 22
> 
> Is there a reason why this is preferable instead of returning
> 
> "\"G\\u00e1bor\""
> 
> i.e.
> 
> charToRaw("\"G\\u00e1bor\"")
> #>  [1] 22 47 5c 75 30 30 65 31 62 6f 72 22
> 
> Returning the \u escaped form would make deparse() the inverse of
> parse(), at least in this respect.
> 
> Thank you,
> Gabor
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel