[Rd] encoding issues even w/o accents

Thu Jan 18 08:56:15 CET 2007

An earlier thread (in 10/2006) discussed encoding issues in the
context of R data and the desire to represent accented characters.

It matters in another setting: the output generated by R and the
seemingly order character "'" (single quote).  In particular, R CMD
check runs test code and compares the generated output to a saved file
of expected output.  This does not work reliably across encoding
schemes.  This is unfortunate, since it seems the "expected output"
files will necessarily be wrong for someone.

The problem for me was triggered by the single-quote character "'".
On my older systems, this is encoded by 0x27, a perfectly fine ASCII
character.  That is on a Debian GNU/Linux system with LANG=en_US.  On
a newer system I have LANG=en_US.UTF-8.  I don't recall whether
this was a deliberate choice on my part, or simply reflects changing
defaults for the installer.  (Note the earlier thread referred to the
Debian-derived Ubuntu systems as having switched to UTF-8).  Under
UTF-8 the same character is encoded in the 3-byte sequence 0xE28098
(which seems odd; I thought the point of UTF-8 was that ASCII was a
legitimate subset).

The coefficient  printing methods in the stats package use the
single-quote in the key explaining significance levels:
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

I suppose one possible work-around for R CMD check would be to set the
encoding to  some standard value before it runs tests, but that has
some drawbacks.  It doesn't work for packages needing a different
encoding (but perhaps the package could specify an encoding to use by
default?)(*),  It will leave the output files looking weird on systems
with a different encoding.  It will get messed up if one generates the
files under the wrong encoding.

And none of this addresses stuff beyond the context of output file
comparison in R CMD check.

Any thoughts?

Ross Boylan

* From the R Extensions document, discussing the DESCRIPTION file:
   If the `DESCRIPTION' file is not entirely in ASCII it should contain
an `Encoding' field specifying an encoding.  This is currently used as
the encoding of the `DESCRIPTION' file itself, and may in the future be
taken as the encoding for other documentation in the package.  Only
encoding names `latin1', `latin2' and `UTF-8' are known to be portable.

I would not expect that the test output files be considered
"documentation," but I suppose that's subject to interpretation.