[Rd] deparse() and UTF-8 strings

Prof Brian Ripley r|p|ey @end|ng |rom @t@t@@ox@@c@uk
Thu Feb 24 07:12:16 CET 2022


On 22/02/2022 09:53, Gábor Csárdi wrote:
> I just saw a commit accidentally that adds iconv() support for the c99
> \u escapes, which might or might not be accidental:
> https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07

Calling my work 'accidental' felt very rude.  It is work in progress, 
not least as the test suite is unfinished.  Part of that test is to 
ensure it does the same thing as GNU libiconv, where the name and idea 
came from.

> In any case, this is great, and very useful to have cross-platform for
> it. Thank you!
> 
> Would it make sense to generate braced 4-digit \uxxxx sequences, to
> make sure that they don't mix with the surrounding text?
> I.e. \u{xxxx}? (Plus update the 6 to 8 twice.)
> https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07#diff-9a906ea3803721bf2aa8b802e98786c3b096727d87f1c423826e3bba4c112d76R746-R747

But that is not what C99 defines, is it?  (In §6.4.3.)  As \u is always 
followed by 4 hex digits and \U by 8, there is no ambiguity.

> 
> Also, it seems that we need a capital \U for the 8-digit sequences here:
> https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07#diff-9a906ea3803721bf2aa8b802e98786c3b096727d87f1c423826e3bba4c112d76R753
> 
> Thank you again,
> Gabor
> 
> On Mon, Feb 21, 2022 at 2:17 PM Brodie Gaslam <brodie.gaslam using yahoo.com> wrote:
>>
>> I'm not R-core, but happen to have run into this issue.
>>
>> I think this makes sense conceptually, and have had the same thought
>> myself.  One implementation challenge is that the parser has a special
>> branch for Unicode escape strings (e.g. "G\u00e1bor") that limits such
>> input to 10K wide characters, so the parser would need to be modified in
>> order to make this a general solution:
>>
>>   > parse(text=sprintf('"%s"', strrep("G\\u00e1bor", 2000)))
>> Error in parse(text = sprintf("\"%s\"", strrep("G\\u00e1bor", 2000))) :
>>     string at line 1 containing Unicode escapes not in this locale
>> is too long (max 10000 chars)
>>
>> Such strings are rare so maybe an interim solution is just to allow it
>> for deparsing of shorter strings.  The parser modification itself would
>> also have the benefit of speeding up parsing of strings without Unicode
>> escapes.
>>
>> Best,
>>
>> B.
>>
>>
>> On 2/21/22 5:33 AM, Gábor Csárdi wrote:
>>> I am wondering if it would make sense to produce \u escaped strings in
>>> deparse() for UTF-8 input. Currently we have (in R-devel):
>>>
>>> x <- "G\u00e1bor"
>>> Sys.setlocale("LC_ALL", "C")
>>> #> [1] "C/C/C/C/C/en_US.UTF-8"
>>>
>>> deparse(x)
>>> #> [1] "\"G<U+00E1>bor\""
>>>
>>> charToRaw(deparse(x))
>>> #> [1] 22 47 3c 55 2b 30 30 45 31 3e 62 6f 72 22
>>>
>>> Is there a reason why this is preferable instead of returning
>>>
>>> "\"G\\u00e1bor\""
>>>
>>> i.e.
>>>
>>> charToRaw("\"G\\u00e1bor\"")
>>> #>  [1] 22 47 5c 75 30 30 65 31 62 6f 72 22
>>>
>>> Returning the \u escaped form would make deparse() the inverse of
>>> parse(), at least in this respect.
>>>
>>> Thank you,
>>> Gabor
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
Brian D. Ripley,                  ripley using stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford



More information about the R-devel mailing list