[R-SIG-Mac] Reading in a table originally with ISO-latin1 encoding (in Linux)

Thu Jun 22 18:43:12 CEST 2006

Dear colleagues,

With the help of a colleague of mine here in Helsinki (Seppo Nyrkkö) 
who looked at the innards of the R source code for Mac it turned out 
that this was in the end indeed an issue concerning the Mac locale and 
its settings and not R.

Though we had tried this earlier by changing the LANG variable to 
'fi_FI', we hadn't looked hard enough in the available encodings (with 
locale -a) to select the exactly correct value, being:

LANG=fi_FI.IS08859-1; 
export LANG;

With this configuration R was able to happily read in my original 
table with the Scandinavian characters in the header, without no fuss.

Thanks for your advice, and wishing all a good Midsummer,

         -Antti Arppe

On Mon, 12 Jun 2006 r-sig-mac-request at stat.math.ethz.ch wrote:
>   1.  Reading in a table originally with ISO-latin1 encoding
>      (Linux) (J ? rg Beyer)
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 11 Jun 2006 13:30:35 +0200
> From: J ? rg Beyer <Beyerj at students.uni-marburg.de>
> Subject: [R-SIG-Mac] Reading in a table originally with ISO-latin1
> 	encoding (Linux)
> To: <r-sig-mac at stat.math.ethz.ch>
> Message-ID: <C0B1CB7B.1676%Beyerj at students.uni-marburg.de>
> Content-Type: text/plain;	charset="US-ASCII"
>
> Antti,
>
> I think I can offer some help. I can add the following for
>   R 2.1.1 w/ R.app 1.14
>   Mac OS X 10.4.6, PPC G4/400 (Oct. 1999)
>
> If you are only interested in the solution, you can skip the following
> report and jump to the last paragraph.
>
> A tabulated data file with German umlauts in some column headers shows the
> same behavior as yours, if I use your command
>   data <- read.table(file("<filename>", encoding="<encoding>"),
>   header=TRUE)
> or these variations
>   data <- read.table(file("<filename>"), header=TRUE)
>   data <- read.table(file("<filename>"), header=FALSE)
>
> In all these case, the same strange behavior results
> -- respectless whether the file is encoded as "latin1", "utf-8" or the
> generic "Mac Roman"
> -- respectless whether you choose UTF-8 with or without BOM
> -- respectless whether you choose Mac, DOS, or UNIX line feeds
> -- respectless whether you choose Apple's TextEdit, TextWrangler or BBEdit
> for setting/changing the encoding (I prefer the latter for its fine tuning,
> automation, and scripting features)
> -- respectless whether you try to read the file with R on the terminal, or
> with R.app (the Mac GUI)
> -- strange enough, R *croaks about "incomplete lines"* even if there are no
> accented characters (or multibyte characters) in your data file at all,
> *just plain ASCII*... indicating that the problem may be located deeper in
> the parsing process, not in the character set.

> At this point I read (again) the "read.table" help page and found it a bit
> misleading -- the sep=""-option reads as if by default the file is read line
> by line (1st step), and then every line is split into columns wherever a
> stream of white space is found (2nd step).
> I think this is not the case. If you modify your command and explicitly add
> the separator option (tab, in this case)
>  data <- read.table(file("<filename>", encoding="<encoding>"), sep="\t",
>  header=TRUE)
>
>  my file reads in without any problems, be it Latin-1 or UTF-8 (not sure
> how to handle Mac Roman files, at the moment).
> But keep in mind that multibyte characters are possible, but not recommended
> in variable names (or column headers).
>
> Hope this helps.
> Cheers
>
> Joerg
> ------------------------------