[R] Non-ACSII characters in R on Windows

Milan Bouchet-Valat nalimilan at club.fr
Mon Sep 16 16:38:57 CEST 2013


Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
> Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
> > This is a condensed version of the same question on stackexchange here:
> > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
> > If you've already stumbled upon it feel free to ignore.
> > 
> > My problem is that R on US Windows does not read *any* text file that
> > contains *any* foreign characters. It simply reads the first consecutive n
> > ASCII characters and then throws a warning once it reached a foreign
> > character:
> > 
> > > test <- read.table("test.txt", sep=";", dec=",", quote="",
> > fileEncoding="UTF-8")
> > Warning messages:
> > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
> > = "UTF-8") :
> >   invalid input found on input connection 'test.txt'
> > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
> > = "UTF-8") :
> >   incomplete final line found by readTableHeader on 'test.txt'
> > > print(test)
> >        V1
> > 1 english
> > 
> > > Sys.getlocale()
> >    [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> > States.1252;
> >      LC_MONETARY=English_United
> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> > 
> > 
> > It is important to note that that R on linux will read UTF-8 as well as
> > exotic character sets without a problem. I've tried it with the exact same
> > files (one was UTF-8 and another was OEM866 Cyrillic).
> > 
> > If I do not include the fileEncoding parameter, read.table will read the
> > whole CSV file. But naturally it will read it wrong because it does not
> > know the encoding. So whenever I try to specify the fileEncoding, R will
> > throw the warnings and stop once it reaches a foreign character. It's the
> > same story with all international character encodings.
> > Other users on stackexchange have reported exactly the same issue.
> > 
> > 
> > Is anyone here who is on a US version of Windows able to import files with
> > foreign characters? Please let me know.
> A reproducible example would have helped, as requested by the posting
> guide.
> 
> Though I am also experiencing the same problem after saving the data
> below to a CSV file encoded in UTF-8 (you can do this using even the
> Notepad):
> "Ա","Բ"
> 1,10
> 2,20
> 
> This is on a Windows 7 box using French locale, but same codepage 1252
> as yours. What is interesting is that reading the file using
> readLines(file("myFile.csv", encoding="UTF-8"))
> gives no invalid characters. So there must be a bug in read.table().
> 
> 
> But I must note I do not experience issues with French accentuated
> characters like "é" ("\Ue9"). On the contrary, reading Armenian
> characters like "Ա" ("\U531") gives weird results: the character appears
> as <U+0531> instead of Ա.
> 
> Self-contained example, writing the file and reading it back from R:
> tmpfile <- tempfile()
> writeLines("\U531", file(tmpfile, "w", encoding="UTF-8"))
> readLines(file(tmpfile, encoding="UTF-8"))
> # "<U+0531>"
> 
> The same phenomenon happens when creating a data frame from this
> character (as noted on StackExchange):
> data.frame("\U531")
> 
> So my conclusion is that maybe Windows does not really support Unicode
> characters that are not "relevant" for your current locale. And that may
> have created bugs in the way R handles them in read.table(). R
> developers can probably tell us more about it.
After some more investigation, one part of the problem can be traced
back to scan() (with myFile.csv filled as described above):
scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1)
# Read 2 items
# [1] "Ա" "Բ"

Equivalent, but nonsensical to me:
scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",", nlines=1)
# Read 2 items
# [1] "Ա" "Բ"

scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1)
# Read 0 items
# character(0)
# Warning message:
# In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
#  invalid input found on input connection 'myFile.csv'


So there seem to be one part of the issue in scan(), which for some
reason does not work when passed fileEncoding="UTF-8"; and another part
in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.",
probably via make.names(), since:
make.names("\U531")
# "X.U.0531."


Does this make sense to R-core members?


Regards



More information about the R-help mailing list