[R-SIG-Mac] Text encoding and R

Fri Mar 23 20:03:02 CET 2007

Denis,

to be honest, I'm not quite sure I understand your problem. Therefore  
I'll just explain a bit how to work with encodings (you may also want  
to read Brian Ripley's article in R News).

First, the Mac GUI does everything in UTF-8. All files handled by the  
GUI are expected to be in UTF-8 and will be saved in UTF-8 (there is  
an exception for backwards compatibility - if a document to be read  
doesn't appear to be UTF-8, it is assumed to be MacRoman and will be  
converted to UTF-8 automatically) . If you want to read files in the  
GUI from encodings, you have to convert them first.

Now, R can read/write files in a variety of encodings, especially on  
the Mac. The fact that the Mac GUI is using UTF-8 makes it possible  
to display text in pretty much any encoding.

Let's say I have this code:
Bête <- "La Bête"
print(Bête)
in three files: txt-utf8.R as UTF-8 (default when you save it with  
the GUI), txt-mac.R (old Mac Roman encoding that is the default if  
you save it in TextEdit for example) and txt-lat1.R in latin1  
(ISO8859-1, default on Windows). They look like this:

caladan:urbanek$ hexdump -C txt-utf8.R
00000000  42 c3 aa 74 65 20 3c 2d  20 22 4c 61 20 42 c3 aa  |B..te <-  
"La B..|
00000010  74 65 22 0a 70 72 69 6e  74 28 42 c3 aa 74 65 29  |te".print 
(B..te)|
00000020  0a                                                |.|
00000021
caladan:urbanek$ hexdump -C txt-mac.R
00000000  42 90 74 65 20 3c 2d 20  22 4c 61 20 42 90 74 65  |B.te <-  
"La B.te|
00000010  22 0a 70 72 69 6e 74 28  42 90 74 65 29 0a        |".print 
(B.te).|
0000001e
caladan:urbanek$ hexdump -C txt-lat1.R
00000000  42 ea 74 65 20 3c 2d 20  22 4c 61 20 42 ea 74 65  |B.te <-  
"La B.te|
00000010  22 0a 70 72 69 6e 74 28  42 ea 74 65 29 0a        |".print 
(B.te).|

In the Mac GUI you can actually open both txt-utf8.R or txt-mac.R in  
the built-in editor and they'll be fine, because the MacRoman version  
will be automatically converted to UTF-8. But if you want to source  
the code, you have to specify the encoding. Only the txt-utf8.R file  
will work, others won't:

 > source("txt-utf8.R")
[1] "La Bête"
 > source("txt-lat1.R")
Error in source("txt-lat1.R") : invalid multibyte character in  
mbcs_get_next
 > source("txt-mac.R")
Error in source("txt-mac.R") : invalid multibyte character in  
mbcs_get_next

You have to tell R which encoding the file really uses:
 > source("txt-lat1.R",enc="latin1")
[1] "La Bête"
 > source("txt-mac.R",enc="MacRoman")
[1] "La Bête"
 > ls()
[1] "Bête"

The same applies to any other file operations (read/write.table, ..).  
R just has no way of knowing what the encoding of the file is.

Finally, if you want to convert files for other users (e.g. for their  
convenience), you have several options. Unfortunately the Mac GUI  
doesn't have the  option to save in a different encoding (yet), but  
you can use TextEdit or Xcode to open a file and save it in a  
different encoding. Alrernatively you can use iconv directly if you  
want to convert files in a batch, for example:
iconv -f utf-8 -t latin1 txt-utf8.R > txt-lat1.R

One word of caution - if you are *not* using the Mac GUI then R may  
not support non-ASCII characters at, for example:
LANG=C R
 > source("txt-lat1.R")
Error in parse(file, n = -1, NULL, "?") : syntax error at
1: B?

I hope this helps ...

Cheers,
Simon

On Mar 23, 2007, at 11:51 AM, Chabot Denis wrote:

> Hi,
>
> For the last many versions of R (at least since v2, I think), I did
> not need to worry about getting R to output french accented vowels on
> my plots. I can place comments in french in my scripts without any
> problem. In fact things work so well I even started, now and then,
> using variable names with accents (so much prettier!).
>
> Now I'm trying to introduce a student to R. Actually she'll need to
> use some of my programs (scripts) to analyse her data.
>
> But all the accented vowels come up wrong on her copy of R (the
> default gui she got after installing R from CRAN)and worse, her R
> does not like one bit my variable names containing accents.
>
> I know that on my mac, R is not happy if I don't save my programs and
> my data files in UTF8-no BOM (I'm not sure about the no BOM,
> sometimes R accepts files in UTF8 only, sometimes not, but it always
> accepts them if I set the no BOM encoding in TextWrangler).
>
> I've never told R anything about what encoding to use, it chose UTF8
> no BOM all by itself.
>
> I double-checked with these commands:
>> Sys.getlocale(category = "LC_ALL")
> [1] "fr_CA.UTF-8/fr_CA.UTF-8/fr_CA.UTF-8/C/fr_CA.UTF-8/fr_CA.UTF-8"
>> getOption("encoding")
> [1] "native.enc"
>> localeToCharset(locale = Sys.getlocale("LC_CTYPE"))
> [1] "UTF-8"     "ISO8859-1"
>
>
> I thought ISO8859-1 was the same as ISO-Latin 1, and tried again
> reading in a data file containing an accented vowel. R did not like  
> it.
>
> The student on a PC typed the same commands and got:
>
>> Sys.getlocale(category = "LC_ALL")
>>
> [1]"LC_COLLATE=French_Canada.1252;LC_CTYPE=French_Canada.
> 1252;LC_MONETARY=French_Canada.
> 1252;LC_NUMERIC=C;LC_TIME=French_Canada.1252"
>
>
>> getOption("encoding")
>>
> [1] "native.enc"
>
>
>> localeToCharset(locale = Sys.getlocale("LC_CTYPE"))
>>
> [1] "ISO8859-1"
>
>
> It seems that both our versions of R should handle ISO-Latin 1. Do
> you have enough details to tell me why mine does not seem to like ISO-
> Latin 1?
>
> For the collaboration with my student to work (and for her not to
> give up on R), I need to either make my R accept to give me the same
> level of good services in French using ISO-Latin 1, or to tell my
> students's version of R on a PC to accept scripts and data files in
> UTF8.
>
> Which of the two is easiest?
>
> Thanks in advance,
>
> Denis Chabot
>
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>
>