[R] UTF-16 input and read.delim/scan

peter dalgaard pdalgd at gmail.com
Sat May 19 13:49:55 CEST 2012


On May 18, 2012, at 20:19 , Patrick Callier wrote:

> Hi all,
> 
> I am running 64-bit R 2.15.0 on windows 7.  I am trying to use read.delim
> to read from a file that has 2-byte unicode (CJK) characters.
> 
> Here is an example of the data (it is tab-delimited if that gets messed up):
> HITId HITTypeId Title
> 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z 看看句子,写写想法
> 请看以下的句子,再回答问
> 
> So read.delim (code below) doesn't read in correctly.  It reads up until it
> hits the CJK characters and then terminates with a warning:
> Warning messages:
> 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>  invalid input found on input connection 'minimal.txt'
> 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>  incomplete final line found by readTableHeader on 'minimal.txt'
> 
> The "Title" field gets filled with an NA.  I played around with scan() a
> little bit and it can read the file correctly if i send it an open file
> with the correct encoding given for the "encoding" parameter. It barfs with
> the same warnings if I just send it the filename and set the fileEncoding
> parameter.
> 
> Here is some test code with the above text in a file "minimal.txt"
> # works
> scan(file("minimal.txt",encoding="UTF-16LE"),what=character(),nlines=2)
> # don't work
> scan("minimal.txt",what=character(),nlines=2)     # output is in wrong
> encoding
> scan("minimal.txt",what=character(),nlines=2,fileEncoding="UTF-16LE")
> #"invalid input found on input connection"
> read.delim(file("minimal.txt",encoding="UTF-16LE"), sep = "\t",
> header=TRUE)    # ditto
> 
> Is this a bug? Or am I just doing something wrong?  Thanks for any help you
> can provide.

This stuff is highly locale dependent (and locales are OS dependent). As I understand things, the encoding= argument to scan() or read.table() says that the file is in a foreign encoding and you want to treat strings in that encoding, whereas fileEncoding= means that you want to convert to your current encoding and then treat the converted data. In the first case, you need to get the encoding right, in the other, in addition, you need to be in a locale that allows the conversion. 

For file(), requesting an encoding means asking for conversion, so if that doesn't work, you are out of luck (and you're just confusing the issue anyway). Here are a couple of examples in Latin1; notice that if you can't convert Chinese characters to your current locale, then the <U+1234> style output is the best you can hope for.

Peter-Dalgaards-MacBook-Air:minimal pd$ LC_ALL="da_DK.ISO8859-1" R --vanilla < minimal2.R

R version 2.14.2 (2012-02-29)
....
> read.delim(file("minimal.txt",encoding="UTF-8"), sep = "\t", header=TRUE,encoding="UTF-8")
                           HITId                      HITTypeId Title Question
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z    NA       NA
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'minimal.txt'
> read.delim(file="minimal.txt", encoding="UTF-8")
                           HITId                      HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
                                                                     Title
1 <U+770B><U+770B><U+53E5><U+5B50><U+FF0C><U+5199><U+5199><U+60F3><U+6CD5>
                                                                                          Question
1 <U+8BF7><U+770B><U+4EE5><U+4E0B><U+7684><U+53E5><U+5B50><U+FF0C><U+518D><U+56DE><U+7B54><U+95EE>
> read.delim(file="minimal.txt")
                           HITId                      HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
                                                                                                         Title
1 ?\234\213?\234\213?\217??\220?\214?\206\231?\206\231?\203??\225
                                                                                                                                          Question
1 请?\234\213以?\213?\232\204?\217??\220?\214?\206\215?\233\236?\224?\227?
> read.delim(file="minimal.txt", fileEncoding="UTF-8")
                           HITId                      HITTypeId Title Question
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z    NA       NA
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'minimal.txt'
> 

 


> 
> --Pat
> 
> -- 
> Patrick Callier
> Georgetown University
> http://www12.georgetown.edu/students/prc23/
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list