[Rd] deparse() and UTF-8 strings
Brodie Gaslam
brod|e@g@@|@m @end|ng |rom y@hoo@com
Mon Feb 21 14:17:49 CET 2022
I'm not R-core, but happen to have run into this issue.
I think this makes sense conceptually, and have had the same thought
myself. One implementation challenge is that the parser has a special
branch for Unicode escape strings (e.g. "G\u00e1bor") that limits such
input to 10K wide characters, so the parser would need to be modified in
order to make this a general solution:
> parse(text=sprintf('"%s"', strrep("G\\u00e1bor", 2000)))
Error in parse(text = sprintf("\"%s\"", strrep("G\\u00e1bor", 2000))) :
string at line 1 containing Unicode escapes not in this locale
is too long (max 10000 chars)
Such strings are rare so maybe an interim solution is just to allow it
for deparsing of shorter strings. The parser modification itself would
also have the benefit of speeding up parsing of strings without Unicode
escapes.
Best,
B.
On 2/21/22 5:33 AM, Gábor Csárdi wrote:
> I am wondering if it would make sense to produce \u escaped strings in
> deparse() for UTF-8 input. Currently we have (in R-devel):
>
> x <- "G\u00e1bor"
> Sys.setlocale("LC_ALL", "C")
> #> [1] "C/C/C/C/C/en_US.UTF-8"
>
> deparse(x)
> #> [1] "\"G<U+00E1>bor\""
>
> charToRaw(deparse(x))
> #> [1] 22 47 3c 55 2b 30 30 45 31 3e 62 6f 72 22
>
> Is there a reason why this is preferable instead of returning
>
> "\"G\\u00e1bor\""
>
> i.e.
>
> charToRaw("\"G\\u00e1bor\"")
> #> [1] 22 47 5c 75 30 30 65 31 62 6f 72 22
>
> Returning the \u escaped form would make deparse() the inverse of
> parse(), at least in this respect.
>
> Thank you,
> Gabor
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list