[R] Can't read table encoded in Unicode (R-2.8.1)
Hilmar Berger
hilmar.berger at gmx.de
Sat Apr 18 22:40:11 CEST 2009
Hi Duncan,
Thanks, this solves my problem.
Regards, Hilmar
Duncan Murdoch schrieb:
> On 18/04/2009 1:18 PM, Hilmar Berger wrote:
>> Hi all,
>>
>> I have problems reading Unicode (UTF-16) coded tables in R 2.8.1
>> under Windows Vista.
>>
>> Imagine the following table:
>>
>> a b c d
>> X 1,2 1,3 1,4
>> Y 2,2 2,3 2,4
>> Z 3,2 3,3 3,4
>>
>> Usually I would use the following code to read the table:
>>
>> t = read.table("test.txt", header=T, sep="\t",dec=",")
>>
>> This works well if I create the table using Notepad (the text will be
>> in UTF-8 or ASCII, then).
>
> I haven't tried 2.8.1 (which is obsolete, since yesterday :-), but in
> 2.9.0 it works fine if I use the fileEncoding argument to read.table.
>
> Duncan Murdoch
>
>
>> However, If I use e.g. OpenOffice scalc to create a spreadsheet
>> holding the same data and save this data as text (using tabs as
>> separators, no quotes and using Unicode encoding) the command above
>> gives this:
>>
>> > t = read.table("test.csv", header=T, sep="\t",dec=",")
>> > t
>> ÿþa
>> 1 NA
>> 2 NA
>> 3 NA
>>
>> I tried to play with the "encoding" parameter but that would not
>> change anything.
>>
>> The file from OpenOffice is in UTF-16, as shown by hexdump:
>> $ hexdump test.csv
>> 0000000 feff 0061 0009 0062 0009 0063 0009 0064
>> 0000010 000d 000a 0058 0009 0031 002c 0032 0009
>> 0000020 0031 002c 0033 0009 0031 002c 0034 000d
>> 0000030 000a 0059 0009 0032 002c 0032 0009 0032
>> 0000040 002c 0033 0009 0032 002c 0034 000d 000a
>> 0000050 005a 0009 0033 002c 0032 0009 0033 002c
>> 0000060 0033 0009 0033 002c 0034 000d 000a
>> 000006e
>>
>> I tried to read the file using file/readLines, which seemed to work
>> after specifying the encoding:
>>
>> > a = file("test.csv",open="r", encoding="UTF-16")
>> > b = readLines(a)
>> > b
>> [1] "a\tb\tc\td" "X\t1,2\t1,3\t1,4" "Y\t2,2\t2,3\t2,4"
>> "Z\t3,2\t3,3\t3,4"
>>
>> Looking at the code of readtable.R in R-2.8.1. and R-2.9.0 it seems
>> that the encoding does not get passed through in the second call to
>> scan() appearing in the code.
>>
>> I'm not sure if this is a bug or if I'm doing something wrong here.
>>
>> Regards,
>> Hilmar
>>
>> ------------------
>> My system and R settings are:
>>
>> > sessionInfo()
>> R version 2.8.1 (2008-12-22)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252
>>
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>> loaded via a namespace (and not attached):
>> [1] tools_2.8.1
>>
>> > Sys.info()
>> sysname
>> release version nodename
>> "Windows" "Vista" "build
>> 6001, Service Pack 1" "PC"
>> machine
>> login user
>> "x86"
>> > options("encoding")
>> $encoding
>> [1] "native.enc"
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list