[R-SIG-Mac] Reading in a table originally with ISO-latin1 encoding (in Linux)
Seppo Nyrkkö
seppo.nyrkko at helsinki.fi
Thu Jun 29 21:19:23 CEST 2006
Dear colleagues,
It seems like we found the solution to this latin-1-problem by taking a
curved route. Here is a brief summary of my observations and my
checklist for running R in a latin-1 environment.
I hope some of you will find this helpful. A huge thank-you goes to
Antti for introducing the R project to me with his Intel MacBook!
Things observed
---------------
I tracked down the possible parts of the source code where printing and
reading input are implemented. I found that the only possible points of
strange effects are at those, which use the POSIX library calls to
functions such as "isprint" and "iswprint". These functions act
run-time dependent of the current locale, and they check whether a
character (may be 8-bit or unicode) is printable. For instance, the
"non-printable" characters in table headers get replaced by "."
characters in function "do_makenames", judged by the is[w]print system
function. The string printing functions are written twice, first for
the MBCS environments and then for the 8-bit environments.
I observed, while reading the source, that the R program binary
supports multi-byte character strings (including utf-8) properly, if
"support-mbcs" is enabled at compile-time. The R binary packages at
CRAN have this feature enabled, so the problem is not in the
compilation.
The R program decides whether to use MBCS or 8-bit support at the point
when I start the program at the command line. The MBCS support is
activated in the UTF-8 environment, e.g. when running in the "fi_FI"
locale. Oppositely, the 8-bit support is activated when running in the
latin-1 locale ("fi_FI.iso8859-1"). The locale check, implemented in
file platform.c, is done by checking the maximum byte count of a
character. If the count (MB_CUR_MAX) is greater than 1, the MBCS is
activated at run-time, and 8-bit support is disabled, oppositely.
Then, by a chance, I tried: export LANG="fi_FI" ; bin/R
And I had a working UTF-8 session!
Antti worked out the right latin-1 locale name next day after lunch:
export LANG="fi_FI.IS08859-1" ; bin/R
And, behold, the latin-1 characters were available!
Pre-flight checklist
--------------------
The required steps for running R internationally in Mac OS X.
Step 1:
Setting the Mac OS X terminal to the right encoding. Command-I opens
the terminal info window. The "display" tab contains the character
encoding drop-down list. I make my choice between ISO Latin-1 and
UTF-8. In the "emulation" tab there is an option for "escaping
non-ascii characters". I have had trouble with this option in
interactive character-based programs, so I have disabled this.
Step 2:
Enabling the readline library to read international input. The mac os x
shell is bare in the internationalization aspect. I chose to write
".inputrc" file as follows:
$ cat .inputrc
set input-meta on
set output-meta on
set convert-meta off
Thanks go to our unix-aware staff at the department of linguistics.
This allows me to use 8-bit and utf-8 characters at the command prompt,
too! Just echo some non-ascii characters to be sure of this.
$ echo "hähää"
hähää
Seems to be working, so we'll proceed to step 3.
Step 3:
The default locale in the OS X shell is bare "C", which gives only the
basic 7-bit support without any international extras. (Here bin/R is
the relative pathname to my R binary installation)
When using utf-8 files, I have to command
export LANG=fi_FI ; bin/R.
When using latin-1 files, I say
export LANG="fi_FI.IS08859-1" ; bin/R
Testing
-------
Now we have the R running in the terminal, supporting latin-1
characters!
> print("ä")
[1] "ä"
> t<-read.table("foo.txt",header=TRUE)
> t
ööö pöö
1 mäö föö
2 röä fåå
To be explored
---------------
- R.app
I haven't yet run the graphical user interface R.app, but its
objective-c source code suggests that the input given to the R "core"
binary and the output received from the binary are handled in the UTF-8
way. Whether the R core runs in MBCS or 8-bit environment or in the "C"
locale is not clear.
- X11 support
I installed the required f77 compiler provided by the "fink" package,
but I didn't yet check whether I can plot latin-1 characters
graphically. I'll need to install the X11 headers and shared libraries
before enabling the X11 support.
with best regards
Seppo Nyrkkö
on Jun 22, 2006 at 19:43, Antti Arppe wrote:
Dear colleagues,
>
> With the help of a colleague of mine here in Helsinki (Seppo Nyrkkö)
> who looked at the innards of the R source code for Mac it turned out
> that this was in the end indeed an issue concerning the Mac locale and
> its settings and not R.
>
> Though we had tried this earlier by changing the LANG variable to
> 'fi_FI', we hadn't looked hard enough in the available encodings (with
> locale -a) to select the exactly correct value, being:
>
> LANG=fi_FI.IS08859-1; export LANG;
>
> With this configuration R was able to happily read in my original
> table with the Scandinavian characters in the header, without no fuss.
>
> Thanks for your advice, and wishing all a good Midsummer,
>
> -Antti Arppe
>
> On Mon, 12 Jun 2006 r-sig-mac-request at stat.math.ethz.ch wrote:
>> 1. Reading in a table originally with ISO-latin1 encoding
>> (Linux) (J ? rg Beyer)
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Sun, 11 Jun 2006 13:30:35 +0200
>> From: J ? rg Beyer <Beyerj at students.uni-marburg.de>
>> Subject: [R-SIG-Mac] Reading in a table originally with ISO-latin1
>> encoding (Linux)
>> To: <r-sig-mac at stat.math.ethz.ch>
>> Message-ID: <C0B1CB7B.1676%Beyerj at students.uni-marburg.de>
>> Content-Type: text/plain; charset="US-ASCII"
>>
>> Antti,
>>
>> I think I can offer some help. I can add the following for
>> R 2.1.1 w/ R.app 1.14
>> Mac OS X 10.4.6, PPC G4/400 (Oct. 1999)
>>
>> If you are only interested in the solution, you can skip the following
>> report and jump to the last paragraph.
>>
>> A tabulated data file with German umlauts in some column headers
>> shows the
>> same behavior as yours, if I use your command
>> data <- read.table(file("<filename>", encoding="<encoding>"),
>> header=TRUE)
>> or these variations
>> data <- read.table(file("<filename>"), header=TRUE)
>> data <- read.table(file("<filename>"), header=FALSE)
>>
>> In all these case, the same strange behavior results
>> -- respectless whether the file is encoded as "latin1", "utf-8" or the
>> generic "Mac Roman"
>> -- respectless whether you choose UTF-8 with or without BOM
>> -- respectless whether you choose Mac, DOS, or UNIX line feeds
>> -- respectless whether you choose Apple's TextEdit, TextWrangler or
>> BBEdit
>> for setting/changing the encoding (I prefer the latter for its fine
>> tuning,
>> automation, and scripting features)
>> -- respectless whether you try to read the file with R on the
>> terminal, or
>> with R.app (the Mac GUI)
>> -- strange enough, R *croaks about "incomplete lines"* even if there
>> are no
>> accented characters (or multibyte characters) in your data file at
>> all,
>> *just plain ASCII*... indicating that the problem may be located
>> deeper in
>> the parsing process, not in the character set.
>
>> At this point I read (again) the "read.table" help page and found it
>> a bit
>> misleading -- the sep=""-option reads as if by default the file is
>> read line
>> by line (1st step), and then every line is split into columns
>> wherever a
>> stream of white space is found (2nd step).
>> I think this is not the case. If you modify your command and
>> explicitly add
>> the separator option (tab, in this case)
>> data <- read.table(file("<filename>", encoding="<encoding>"),
>> sep="\t",
>> header=TRUE)
>>
>> my file reads in without any problems, be it Latin-1 or UTF-8 (not
>> sure
>> how to handle Mac Roman files, at the moment).
>> But keep in mind that multibyte characters are possible, but not
>> recommended
>> in variable names (or column headers).
>>
>> Hope this helps.
>> Cheers
>>
>> Joerg
>> ------------------------------
>> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
More information about the R-SIG-Mac
mailing list