[Rd] Character encodings and packages
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sun Jan 27 12:49:29 CET 2008
Since R 2.5.0 it has been possible to declare the encodings of character
strings (at the level of individual elements of a character vector).
As a reminder, here is the announcement in NEWS
o R now attempts to keep track of character strings which are
known to be in Latin-1 or UTF-8 and print or plot them
appropriately in other locales. This is primarily intended
to make it possible to use data in Western European languages
in both Latin-1 and UTF-8 locales. Currently scan(),
read.table(), readLines(), parse() and source() allow
encodings to be declared, and console input in suitable
locales is also recognized.
New function Encoding() can read or set the declared encodings
for a character vector.
Whereas R itself is careful to make use of this, I see very little
recognition of it in packages -- which need to be making use of
translateChar() rather than CHAR(): see the 'Writing R Extensions' manual.
(I see it used in only one package, and that mainly in a copy of base R
code.)
This will become more important as time goes by and more ways are
introduced to generate marked data. In particular, in R 2.7.0 under
Windows 'Unicode' data (as used by NT-based versions of Windows, usually
UCS-2 but possibly UTF-16) is translated to UTF-8 and marked as such.
In essence, every time you use CHAR() in .Call/.External call in a package
you should consider if the data can be non-ASCII and if so how you want to
handle it. The choices are
- to replace CHAR() by translateChar() and handle the string in the native
encoding of the current locale. This needs the package to depend on
'R (>= 2.5.0)'.
- to note the declared encoding and handle the string in that encoding.
- to translate the string to UTF-8 and handle it in UTF-8. This will be
easiest to do in R >= 2.7.0 using the function translateCharUTF8().
For writers of graphics devices where is a further twist in R >= 2.7.0:
currently text is passed to the graphics device in the native encoding,
but by setting the DevDesc variable hasTextUTF8 to TRUE you can indicate
to the graphics engine the ability to accept text in UTF-8. This is done
in several of the standard devices: for example windows() was already
re-encoding to UCS-2 for plotting, and postscript()/pdf() also re-encode
to the selected single-byte encoding.
Character data passed to .C or .Fortran is automatically re-encoded to the
current locale (for .C, from the encoding specified by ENCODING=,
otherwise from the declared encoding if any).
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list