[R-SIG-Mac] Text encoding and R
Simon Urbanek
simon.urbanek at r-project.org
Fri Mar 23 20:03:02 CET 2007
Denis,
to be honest, I'm not quite sure I understand your problem. Therefore
I'll just explain a bit how to work with encodings (you may also want
to read Brian Ripley's article in R News).
First, the Mac GUI does everything in UTF-8. All files handled by the
GUI are expected to be in UTF-8 and will be saved in UTF-8 (there is
an exception for backwards compatibility - if a document to be read
doesn't appear to be UTF-8, it is assumed to be MacRoman and will be
converted to UTF-8 automatically) . If you want to read files in the
GUI from encodings, you have to convert them first.
Now, R can read/write files in a variety of encodings, especially on
the Mac. The fact that the Mac GUI is using UTF-8 makes it possible
to display text in pretty much any encoding.
Let's say I have this code:
Bête <- "La Bête"
print(Bête)
in three files: txt-utf8.R as UTF-8 (default when you save it with
the GUI), txt-mac.R (old Mac Roman encoding that is the default if
you save it in TextEdit for example) and txt-lat1.R in latin1
(ISO8859-1, default on Windows). They look like this:
caladan:urbanek$ hexdump -C txt-utf8.R
00000000 42 c3 aa 74 65 20 3c 2d 20 22 4c 61 20 42 c3 aa |B..te <-
"La B..|
00000010 74 65 22 0a 70 72 69 6e 74 28 42 c3 aa 74 65 29 |te".print
(B..te)|
00000020 0a |.|
00000021
caladan:urbanek$ hexdump -C txt-mac.R
00000000 42 90 74 65 20 3c 2d 20 22 4c 61 20 42 90 74 65 |B.te <-
"La B.te|
00000010 22 0a 70 72 69 6e 74 28 42 90 74 65 29 0a |".print
(B.te).|
0000001e
caladan:urbanek$ hexdump -C txt-lat1.R
00000000 42 ea 74 65 20 3c 2d 20 22 4c 61 20 42 ea 74 65 |B.te <-
"La B.te|
00000010 22 0a 70 72 69 6e 74 28 42 ea 74 65 29 0a |".print
(B.te).|
In the Mac GUI you can actually open both txt-utf8.R or txt-mac.R in
the built-in editor and they'll be fine, because the MacRoman version
will be automatically converted to UTF-8. But if you want to source
the code, you have to specify the encoding. Only the txt-utf8.R file
will work, others won't:
> source("txt-utf8.R")
[1] "La Bête"
> source("txt-lat1.R")
Error in source("txt-lat1.R") : invalid multibyte character in
mbcs_get_next
> source("txt-mac.R")
Error in source("txt-mac.R") : invalid multibyte character in
mbcs_get_next
You have to tell R which encoding the file really uses:
> source("txt-lat1.R",enc="latin1")
[1] "La Bête"
> source("txt-mac.R",enc="MacRoman")
[1] "La Bête"
> ls()
[1] "Bête"
The same applies to any other file operations (read/write.table, ..).
R just has no way of knowing what the encoding of the file is.
Finally, if you want to convert files for other users (e.g. for their
convenience), you have several options. Unfortunately the Mac GUI
doesn't have the option to save in a different encoding (yet), but
you can use TextEdit or Xcode to open a file and save it in a
different encoding. Alrernatively you can use iconv directly if you
want to convert files in a batch, for example:
iconv -f utf-8 -t latin1 txt-utf8.R > txt-lat1.R
One word of caution - if you are *not* using the Mac GUI then R may
not support non-ASCII characters at, for example:
LANG=C R
> source("txt-lat1.R")
Error in parse(file, n = -1, NULL, "?") : syntax error at
1: B?
I hope this helps ...
Cheers,
Simon
On Mar 23, 2007, at 11:51 AM, Chabot Denis wrote:
> Hi,
>
> For the last many versions of R (at least since v2, I think), I did
> not need to worry about getting R to output french accented vowels on
> my plots. I can place comments in french in my scripts without any
> problem. In fact things work so well I even started, now and then,
> using variable names with accents (so much prettier!).
>
> Now I'm trying to introduce a student to R. Actually she'll need to
> use some of my programs (scripts) to analyse her data.
>
> But all the accented vowels come up wrong on her copy of R (the
> default gui she got after installing R from CRAN)and worse, her R
> does not like one bit my variable names containing accents.
>
> I know that on my mac, R is not happy if I don't save my programs and
> my data files in UTF8-no BOM (I'm not sure about the no BOM,
> sometimes R accepts files in UTF8 only, sometimes not, but it always
> accepts them if I set the no BOM encoding in TextWrangler).
>
> I've never told R anything about what encoding to use, it chose UTF8
> no BOM all by itself.
>
> I double-checked with these commands:
>> Sys.getlocale(category = "LC_ALL")
> [1] "fr_CA.UTF-8/fr_CA.UTF-8/fr_CA.UTF-8/C/fr_CA.UTF-8/fr_CA.UTF-8"
>> getOption("encoding")
> [1] "native.enc"
>> localeToCharset(locale = Sys.getlocale("LC_CTYPE"))
> [1] "UTF-8" "ISO8859-1"
>
>
> I thought ISO8859-1 was the same as ISO-Latin 1, and tried again
> reading in a data file containing an accented vowel. R did not like
> it.
>
> The student on a PC typed the same commands and got:
>
>> Sys.getlocale(category = "LC_ALL")
>>
> [1]"LC_COLLATE=French_Canada.1252;LC_CTYPE=French_Canada.
> 1252;LC_MONETARY=French_Canada.
> 1252;LC_NUMERIC=C;LC_TIME=French_Canada.1252"
>
>
>> getOption("encoding")
>>
> [1] "native.enc"
>
>
>> localeToCharset(locale = Sys.getlocale("LC_CTYPE"))
>>
> [1] "ISO8859-1"
>
>
> It seems that both our versions of R should handle ISO-Latin 1. Do
> you have enough details to tell me why mine does not seem to like ISO-
> Latin 1?
>
> For the collaboration with my student to work (and for her not to
> give up on R), I need to either make my R accept to give me the same
> level of good services in French using ISO-Latin 1, or to tell my
> students's version of R on a PC to accept scripts and data files in
> UTF8.
>
> Which of the two is easiest?
>
> Thanks in advance,
>
> Denis Chabot
>
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>
>
More information about the R-SIG-Mac
mailing list