[R] Can't read table encoded in Unicode (R-2.8.1)
Hilmar Berger
hilmar.berger at gmx.de
Sat Apr 18 19:18:06 CEST 2009
Hi all,
I have problems reading Unicode (UTF-16) coded tables in R 2.8.1 under
Windows Vista.
Imagine the following table:
a b c d
X 1,2 1,3 1,4
Y 2,2 2,3 2,4
Z 3,2 3,3 3,4
Usually I would use the following code to read the table:
t = read.table("test.txt", header=T, sep="\t",dec=",")
This works well if I create the table using Notepad (the text will be in
UTF-8 or ASCII, then).
However, If I use e.g. OpenOffice scalc to create a spreadsheet holding
the same data and save this data as text (using tabs as separators, no
quotes and using Unicode encoding) the command above gives this:
> t = read.table("test.csv", header=T, sep="\t",dec=",")
> t
ÿþa
1 NA
2 NA
3 NA
I tried to play with the "encoding" parameter but that would not change
anything.
The file from OpenOffice is in UTF-16, as shown by hexdump:
$ hexdump test.csv
0000000 feff 0061 0009 0062 0009 0063 0009 0064
0000010 000d 000a 0058 0009 0031 002c 0032 0009
0000020 0031 002c 0033 0009 0031 002c 0034 000d
0000030 000a 0059 0009 0032 002c 0032 0009 0032
0000040 002c 0033 0009 0032 002c 0034 000d 000a
0000050 005a 0009 0033 002c 0032 0009 0033 002c
0000060 0033 0009 0033 002c 0034 000d 000a
000006e
I tried to read the file using file/readLines, which seemed to work
after specifying the encoding:
> a = file("test.csv",open="r", encoding="UTF-16")
> b = readLines(a)
> b
[1] "a\tb\tc\td" "X\t1,2\t1,3\t1,4" "Y\t2,2\t2,3\t2,4"
"Z\t3,2\t3,3\t3,4"
Looking at the code of readtable.R in R-2.8.1. and R-2.9.0 it seems that
the encoding does not get passed through in the second call to scan()
appearing in the code.
I'm not sure if this is a bug or if I'm doing something wrong here.
Regards,
Hilmar
------------------
My system and R settings are:
> sessionInfo()
R version 2.8.1 (2008-12-22)
i386-pc-mingw32
locale:
LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.8.1
> Sys.info()
sysname
release version nodename
"Windows" "Vista" "build 6001,
Service Pack 1" "PC"
machine
login user
"x86"
> options("encoding")
$encoding
[1] "native.enc"
More information about the R-help
mailing list