[Rd] encoding issues even w/o accents (background on single quotes)
Ross Boylan
ross at biostat.ucsf.edu
Fri Jan 19 20:39:49 CET 2007
On Wed, Jan 17, 2007 at 11:56:15PM -0800, Ross Boylan wrote:
> An earlier thread (in 10/2006) discussed encoding issues in the
> context of R data and the desire to represent accented characters.
>
> It matters in another setting: the output generated by R and the
> seemingly order character "'" (single quote). In particular, R CMD
^^^ should be "ordinary"
> check runs test code and compares the generated output to a saved file
> of expected output. This does not work reliably across encoding
> schemes. This is unfortunate, since it seems the "expected output"
> files will necessarily be wrong for someone.
>
> The problem for me was triggered by the single-quote character "'".
> On my older systems, this is encoded by 0x27, a perfectly fine ASCII
> character. That is on a Debian GNU/Linux system with LANG=en_US. On
> a newer system I have LANG=en_US.UTF-8. I don't recall whether
> this was a deliberate choice on my part, or simply reflects changing
> defaults for the installer. (Note the earlier thread referred to the
> Debian-derived Ubuntu systems as having switched to UTF-8). Under
> UTF-8 the same character is encoded in the 3-byte sequence 0xE28098
> (which seems odd; I thought the point of UTF-8 was that ASCII was a
> legitimate subset).
Apparently quoting, particularly single quotes, is a can of worms:
http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
When Unicode is available (which would be the case with UTF-8),
particular non-ASCII characters are recommended for single quoting.
The 3 byte sequence is the UTF-8 encoding of x2018, the recommended
left single quote mark.
See http://en.wikipedia.org/wiki/UTF-8 on UTF-8 encoding.
This is more than I or, probably, you ever wanted to know about this
issue!
Ross
>
> The coefficient printing methods in the stats package use the
> single-quote in the key explaining significance levels:
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> I suppose one possible work-around for R CMD check would be to set the
> encoding to some standard value before it runs tests, but that has
> some drawbacks. It doesn't work for packages needing a different
> encoding (but perhaps the package could specify an encoding to use by
> default?)(*), It will leave the output files looking weird on systems
> with a different encoding. It will get messed up if one generates the
> files under the wrong encoding.
>
> And none of this addresses stuff beyond the context of output file
> comparison in R CMD check.
>
> Any thoughts?
>
> Ross Boylan
>
>
> * From the R Extensions document, discussing the DESCRIPTION file:
> If the `DESCRIPTION' file is not entirely in ASCII it should contain
> an `Encoding' field specifying an encoding. This is currently used as
> the encoding of the `DESCRIPTION' file itself, and may in the future be
> taken as the encoding for other documentation in the package. Only
> encoding names `latin1', `latin2' and `UTF-8' are known to be portable.
>
> I would not expect that the test output files be considered
> "documentation," but I suppose that's subject to interpretation.
More information about the R-devel
mailing list