[R-SIG-Mac] Reading in a table originally with ISO-latin1 encoding (Linux)
J ö rg Beyer
Beyerj at students.uni-marburg.de
Sun Jun 11 13:30:35 CEST 2006
Antti,
I think I can offer some help. I can add the following for
R 2.1.1 w/ R.app 1.14
Mac OS X 10.4.6, PPC G4/400 (Oct. 1999)
If you are only interested in the solution, you can skip the following
report and jump to the last paragraph.
A tabulated data file with German umlauts in some column headers shows the
same behavior as yours, if I use your command
data <- read.table(file("<filename>", encoding="<encoding>"),
header=TRUE)
or these variations
data <- read.table(file("<filename>"), header=TRUE)
data <- read.table(file("<filename>"), header=FALSE)
In all these case, the same strange behavior results
-- respectless whether the file is encoded as "latin1", "utf-8" or the
generic "Mac Roman"
-- respectless whether you choose UTF-8 with or without BOM
-- respectless whether you choose Mac, DOS, or UNIX line feeds
-- respectless whether you choose Apple's TextEdit, TextWrangler or BBEdit
for setting/changing the encoding (I prefer the latter for its fine tuning,
automation, and scripting features)
-- respectless whether you try to read the file with R on the terminal, or
with R.app (the Mac GUI)
-- strange enough, R *croaks about "incomplete lines"* even if there are no
accented characters (or multibyte characters) in your data file at all,
*just plain ASCII*... indicating that the problem may be located deeper in
the parsing process, not in the character set.
At this point I read (again) the "read.table" help page and found it a bit
misleading -- the sep=""-option reads as if by default the file is read line
by line (1st step), and then every line is split into columns wherever a
stream of white space is found (2nd step).
I think this is not the case. If you modify your command and explicitly add
the separator option (tab, in this case)
data <- read.table(file("<filename>", encoding="<encoding>"), sep="\t",
header=TRUE)
my file reads in without any problems, be it Latin-1 or UTF-8 (not sure
how to handle Mac Roman files, at the moment).
But keep in mind that multibyte characters are possible, but not recommended
in variable names (or column headers).
Hope this helps.
Cheers
Joerg
More information about the R-SIG-Mac
mailing list