[R] Can't read table encoded in Unicode (R-2.8.1)

Duncan Murdoch murdoch at stats.uwo.ca
Sat Apr 18 21:52:09 CEST 2009


On 18/04/2009 1:18 PM, Hilmar Berger wrote:
> Hi all,
> 
> I have problems reading Unicode (UTF-16) coded tables in R 2.8.1 under 
> Windows Vista.
> 
> Imagine the following table:
> 
> a    b    c    d
> X    1,2    1,3    1,4
> Y    2,2    2,3    2,4
> Z    3,2    3,3    3,4
> 
> Usually I would use the following code to read the table:
> 
> t = read.table("test.txt", header=T, sep="\t",dec=",")
> 
> This works well if I create the table using Notepad (the text will be in 
> UTF-8 or ASCII, then).

I haven't tried 2.8.1 (which is obsolete, since yesterday :-), but in 
2.9.0 it works fine if I use the fileEncoding argument to read.table.

Duncan Murdoch


> However, If I use e.g. OpenOffice scalc to create a spreadsheet holding 
> the same data and save this data as text (using tabs as separators, no 
> quotes and using Unicode encoding)  the command above gives this:
> 
>  > t = read.table("test.csv", header=T, sep="\t",dec=",")
>  > t
>   ÿþa
> 1  NA
> 2  NA
> 3  NA
> 
> I tried to play with the "encoding" parameter but that would not change 
> anything.
> 
> The file from OpenOffice is in UTF-16, as shown by hexdump:
> $ hexdump test.csv
> 0000000 feff 0061 0009 0062 0009 0063 0009 0064
> 0000010 000d 000a 0058 0009 0031 002c 0032 0009
> 0000020 0031 002c 0033 0009 0031 002c 0034 000d
> 0000030 000a 0059 0009 0032 002c 0032 0009 0032
> 0000040 002c 0033 0009 0032 002c 0034 000d 000a
> 0000050 005a 0009 0033 002c 0032 0009 0033 002c
> 0000060 0033 0009 0033 002c 0034 000d 000a
> 000006e
> 
> I tried to read the file using file/readLines, which seemed to work 
> after specifying the encoding:
> 
>  > a = file("test.csv",open="r", encoding="UTF-16")
>  > b = readLines(a)
>  > b
> [1] "a\tb\tc\td"       "X\t1,2\t1,3\t1,4" "Y\t2,2\t2,3\t2,4" 
> "Z\t3,2\t3,3\t3,4"
> 
> Looking at the code of readtable.R in R-2.8.1. and R-2.9.0 it seems that 
> the encoding does not get passed through in the second call to scan() 
> appearing in the code.
> 
> I'm not sure if this is a bug or if I'm doing something wrong here.
> 
> Regards,
> Hilmar
> 
> ------------------
> My system  and R settings are:
> 
>  > sessionInfo()
> R version 2.8.1 (2008-12-22)
> i386-pc-mingw32
> 
> locale:
> LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base    
> 
> loaded via a namespace (and not attached):
> [1] tools_2.8.1
> 
>  > Sys.info()
>                      sysname                      
> release                      version                     nodename
>                    "Windows"                      "Vista" "build 6001, 
> Service Pack 1"                  "PC"
>                      machine                        
> login                         user
>                        "x86"  
> 
>  > options("encoding")
> $encoding
> [1] "native.enc"
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list