[Rd] use of UTF-8 \uxxxx escape sequences in function arguments
Thomas Zumbrunn
thomas at zumbrunn.name
Fri Jan 20 00:39:18 CET 2012
On Thursday 19 January 2012, peter dalgaard wrote:
> On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote:
> > plain("Zürich") ## works
> > plain("Z\u00BCrich") ## fails
> > escaped("Zürich") ## fails
> > escaped("Z\u00BCrich") ## works
>
> Using the correct UTF-8 code helps quite a bit:
>
> U+00BC ¼ c2 bc VULGAR FRACTION ONE QUARTER
> U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS
Thank you for pointing that out. How embarrassing - I systematically used the
wrong representations. Even worse, I didn't carefully read "Writing R
Extensions" which speaks of "Unicode as \uxxxx escapes" rather than "UTF-8 as
\uxxxx escapes", so e.g. looking up the UTF-16 byte representations would have
done the trick.
I didn't find a recommended method of replacing non-ASCII characters with
Unicode \uxxxx escape sequences and ended up using the Unix command line tool
"iconv". However, the iconv version installed on my GNU/Linux machine
(openSUSE 11.4) seems to be outdated and doesn't support the very useful "--
unicode-subst" option yet. I installed "libiconv" from
http://www.gnu.org/software/libiconv/, and now I can easily replace all non-
ASCII characters in my UTF-8 encoded R files with:
iconv -f UTF-8 -t ASCII --unicode-subst="\u%04X" my-utf-8-encoded-file.R
Thomas Zumbrunn
More information about the R-devel
mailing list