[R-SIG-Mac] Reading in a table originally with ISO-latin1 encoding (in Linux)

Seppo Nyrkkö seppo.nyrkko at helsinki.fi
Thu Jun 29 21:19:23 CEST 2006


Dear colleagues,

It seems like we found the solution to this latin-1-problem by taking a  
curved route. Here is a brief summary of my observations and my  
checklist for running R in a latin-1 environment.

I hope some of you will find this helpful. A huge thank-you goes to  
Antti for introducing the R project to me with his Intel MacBook!



Things observed
---------------

I tracked down the possible parts of the source code where printing and  
reading input are implemented. I found that the only possible points of  
strange effects are at those, which use the POSIX library calls to  
functions such as "isprint" and "iswprint". These functions act  
run-time dependent of the current locale, and they check whether a  
character (may be 8-bit or unicode) is printable. For instance, the  
"non-printable" characters in table headers get replaced by "."  
characters in function "do_makenames", judged by the is[w]print system  
function. The string printing functions are written twice, first for  
the MBCS environments and then for the 8-bit environments.

I observed, while reading the source, that the R program binary  
supports multi-byte character strings (including utf-8) properly, if  
"support-mbcs" is enabled at compile-time. The R binary packages at  
CRAN have this feature enabled, so the problem is not in the  
compilation.

The R program decides whether to use MBCS or 8-bit support at the point  
when I start the program at the command line. The MBCS support is  
activated in the UTF-8 environment, e.g. when running in the "fi_FI"  
locale. Oppositely, the 8-bit support is activated when running in the  
latin-1 locale ("fi_FI.iso8859-1"). The locale check, implemented in  
file platform.c, is done by checking the maximum byte count of a  
character. If the count (MB_CUR_MAX) is greater than 1, the MBCS is  
activated at run-time, and 8-bit support is disabled, oppositely.

Then, by a chance, I tried: export LANG="fi_FI" ; bin/R
And I had a working UTF-8 session!

Antti worked out the right latin-1 locale name next day after lunch:
export LANG="fi_FI.IS08859-1" ; bin/R
And, behold, the latin-1 characters were available!


Pre-flight checklist
--------------------

The required steps for running R internationally in Mac OS X.

Step 1:

Setting the Mac OS X terminal to the right encoding. Command-I opens  
the terminal info window. The "display" tab contains the character  
encoding drop-down list. I make my choice between ISO Latin-1 and  
UTF-8. In the "emulation" tab there is an option for "escaping  
non-ascii characters". I have had trouble with this option in  
interactive character-based programs, so I have disabled this.


Step 2:

Enabling the readline library to read international input. The mac os x  
shell is bare in the internationalization aspect. I chose to write  
".inputrc" file as follows:

$ cat .inputrc
set input-meta on
set output-meta on
set convert-meta off

Thanks go to our unix-aware staff at the department of linguistics.  
This allows me to use 8-bit and utf-8 characters at the command prompt,  
too! Just echo some non-ascii characters to be sure of this.

$ echo "hähää"
hähää

Seems to be working, so we'll proceed to step 3.


Step 3:

The default locale in the OS X shell is bare "C", which gives only the  
basic 7-bit support without any international extras. (Here bin/R is  
the relative pathname to my R binary installation)

When using utf-8 files, I have to command
export LANG=fi_FI ; bin/R.

When using latin-1 files, I say
export LANG="fi_FI.IS08859-1" ; bin/R


Testing
-------

Now we have the R running in the terminal, supporting latin-1  
characters!

 > print("ä")
[1] "ä"

 > t<-read.table("foo.txt",header=TRUE)
 > t
   ööö pöö
1 mäö föö
2 röä fåå




To be explored
---------------

- R.app

I haven't yet run the graphical user interface R.app, but its  
objective-c source code suggests that the input given to the R "core"  
binary and the output received from the binary are handled in the UTF-8  
way. Whether the R core runs in MBCS or 8-bit environment or in the "C"  
locale is not clear.

- X11 support

I installed the required f77 compiler provided by the "fink" package,  
but I didn't yet check whether I can plot latin-1 characters  
graphically. I'll need to install the X11 headers and shared libraries  
before enabling the X11 support.



with best regards

   Seppo Nyrkkö



on Jun 22, 2006 at 19:43, Antti Arppe wrote:

  Dear colleagues,
>
> With the help of a colleague of mine here in Helsinki (Seppo Nyrkkö)  
> who looked at the innards of the R source code for Mac it turned out  
> that this was in the end indeed an issue concerning the Mac locale and  
> its settings and not R.
>
> Though we had tried this earlier by changing the LANG variable to  
> 'fi_FI', we hadn't looked hard enough in the available encodings (with  
> locale -a) to select the exactly correct value, being:
>
> LANG=fi_FI.IS08859-1; export LANG;
>
> With this configuration R was able to happily read in my original  
> table with the Scandinavian characters in the header, without no fuss.
>
> Thanks for your advice, and wishing all a good Midsummer,
>
>         -Antti Arppe
>
> On Mon, 12 Jun 2006 r-sig-mac-request at stat.math.ethz.ch wrote:
>>   1.  Reading in a table originally with ISO-latin1 encoding
>>      (Linux) (J ? rg Beyer)
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Sun, 11 Jun 2006 13:30:35 +0200
>> From: J ? rg Beyer <Beyerj at students.uni-marburg.de>
>> Subject: [R-SIG-Mac] Reading in a table originally with ISO-latin1
>> 	encoding (Linux)
>> To: <r-sig-mac at stat.math.ethz.ch>
>> Message-ID: <C0B1CB7B.1676%Beyerj at students.uni-marburg.de>
>> Content-Type: text/plain;	charset="US-ASCII"
>>
>> Antti,
>>
>> I think I can offer some help. I can add the following for
>>   R 2.1.1 w/ R.app 1.14
>>   Mac OS X 10.4.6, PPC G4/400 (Oct. 1999)
>>
>> If you are only interested in the solution, you can skip the following
>> report and jump to the last paragraph.
>>
>> A tabulated data file with German umlauts in some column headers  
>> shows the
>> same behavior as yours, if I use your command
>>   data <- read.table(file("<filename>", encoding="<encoding>"),
>>   header=TRUE)
>> or these variations
>>   data <- read.table(file("<filename>"), header=TRUE)
>>   data <- read.table(file("<filename>"), header=FALSE)
>>
>> In all these case, the same strange behavior results
>> -- respectless whether the file is encoded as "latin1", "utf-8" or the
>> generic "Mac Roman"
>> -- respectless whether you choose UTF-8 with or without BOM
>> -- respectless whether you choose Mac, DOS, or UNIX line feeds
>> -- respectless whether you choose Apple's TextEdit, TextWrangler or  
>> BBEdit
>> for setting/changing the encoding (I prefer the latter for its fine  
>> tuning,
>> automation, and scripting features)
>> -- respectless whether you try to read the file with R on the  
>> terminal, or
>> with R.app (the Mac GUI)
>> -- strange enough, R *croaks about "incomplete lines"* even if there  
>> are no
>> accented characters (or multibyte characters) in your data file at  
>> all,
>> *just plain ASCII*... indicating that the problem may be located  
>> deeper in
>> the parsing process, not in the character set.
>
>> At this point I read (again) the "read.table" help page and found it  
>> a bit
>> misleading -- the sep=""-option reads as if by default the file is  
>> read line
>> by line (1st step), and then every line is split into columns  
>> wherever a
>> stream of white space is found (2nd step).
>> I think this is not the case. If you modify your command and  
>> explicitly add
>> the separator option (tab, in this case)
>>  data <- read.table(file("<filename>", encoding="<encoding>"),  
>> sep="\t",
>>  header=TRUE)
>>
>>  my file reads in without any problems, be it Latin-1 or UTF-8 (not  
>> sure
>> how to handle Mac Roman files, at the moment).
>> But keep in mind that multibyte characters are possible, but not  
>> recommended
>> in variable names (or column headers).
>>
>> Hope this helps.
>> Cheers
>>
>> Joerg
>> ------------------------------ 
>> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac



More information about the R-SIG-Mac mailing list