[R] Non-ACSII characters in R on Windows

Maxim Linchits mlinchits at gmail.com
Mon Sep 16 20:50:06 CEST 2013


"There is a solution for this problem. Writing a binary file instead
of a text file solves this. All applications handling a UTF-8 file in
Windows are using the same trick."
No reason why R should fail to perform this very standard "trick";
apparently R forgot how it works in 2010.
Just tried the advertised script and it did read the foreign
characters. However,  the starting file, which looked like this:

1a; 1b
2a; 2b
3a; 3b
...

turns into this in the output: (some lines become "a;b","a;b" while
others remain "a;b")

"1a; 1b", "2a; 2b",
"3a;3b",
...

So it's not a smooth substitute for a working read.table() function.
Probably time to recode all the strings into integers and move along.



Best,
Max

On Mon, Sep 16, 2013 at 7:56 PM, Ista Zahn <istazahn at gmail.com> wrote:
> UTF-8 on windows is a huge pain, this bites me often. Usually I give
> up and do the analysis on a Linux server. In previous struggles with
> this I've found this blog post enlightening:
> https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
>
> Best,
> Ista
>
> On Mon, Sep 16, 2013 at 10:38 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
>> Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
>>> Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
>>> > This is a condensed version of the same question on stackexchange here:
>>> > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
>>> > If you've already stumbled upon it feel free to ignore.
>>> >
>>> > My problem is that R on US Windows does not read *any* text file that
>>> > contains *any* foreign characters. It simply reads the first consecutive n
>>> > ASCII characters and then throws a warning once it reached a foreign
>>> > character:
>>> >
>>> > > test <- read.table("test.txt", sep=";", dec=",", quote="",
>>> > fileEncoding="UTF-8")
>>> > Warning messages:
>>> > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
>>> > = "UTF-8") :
>>> >   invalid input found on input connection 'test.txt'
>>> > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
>>> > = "UTF-8") :
>>> >   incomplete final line found by readTableHeader on 'test.txt'
>>> > > print(test)
>>> >        V1
>>> > 1 english
>>> >
>>> > > Sys.getlocale()
>>> >    [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>> > States.1252;
>>> >      LC_MONETARY=English_United
>>> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>>> >
>>> >
>>> > It is important to note that that R on linux will read UTF-8 as well as
>>> > exotic character sets without a problem. I've tried it with the exact same
>>> > files (one was UTF-8 and another was OEM866 Cyrillic).
>>> >
>>> > If I do not include the fileEncoding parameter, read.table will read the
>>> > whole CSV file. But naturally it will read it wrong because it does not
>>> > know the encoding. So whenever I try to specify the fileEncoding, R will
>>> > throw the warnings and stop once it reaches a foreign character. It's the
>>> > same story with all international character encodings.
>>> > Other users on stackexchange have reported exactly the same issue.
>>> >
>>> >
>>> > Is anyone here who is on a US version of Windows able to import files with
>>> > foreign characters? Please let me know.
>>> A reproducible example would have helped, as requested by the posting
>>> guide.
>>>
>>> Though I am also experiencing the same problem after saving the data
>>> below to a CSV file encoded in UTF-8 (you can do this using even the
>>> Notepad):
>>> "Ա","Բ"
>>> 1,10
>>> 2,20
>>>
>>> This is on a Windows 7 box using French locale, but same codepage 1252
>>> as yours. What is interesting is that reading the file using
>>> readLines(file("myFile.csv", encoding="UTF-8"))
>>> gives no invalid characters. So there must be a bug in read.table().
>>>
>>>
>>> But I must note I do not experience issues with French accentuated
>>> characters like "é" ("\Ue9"). On the contrary, reading Armenian
>>> characters like "Ա" ("\U531") gives weird results: the character appears
>>> as <U+0531> instead of Ա.
>>>
>>> Self-contained example, writing the file and reading it back from R:
>>> tmpfile <- tempfile()
>>> writeLines("\U531", file(tmpfile, "w", encoding="UTF-8"))
>>> readLines(file(tmpfile, encoding="UTF-8"))
>>> # "<U+0531>"
>>>
>>> The same phenomenon happens when creating a data frame from this
>>> character (as noted on StackExchange):
>>> data.frame("\U531")
>>>
>>> So my conclusion is that maybe Windows does not really support Unicode
>>> characters that are not "relevant" for your current locale. And that may
>>> have created bugs in the way R handles them in read.table(). R
>>> developers can probably tell us more about it.
>> After some more investigation, one part of the problem can be traced
>> back to scan() (with myFile.csv filled as described above):
>> scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1)
>> # Read 2 items
>> # [1] "Ա" "Բ"
>>
>> Equivalent, but nonsensical to me:
>> scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",", nlines=1)
>> # Read 2 items
>> # [1] "Ա" "Բ"
>>
>> scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1)
>> # Read 0 items
>> # character(0)
>> # Warning message:
>> # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
>> #  invalid input found on input connection 'myFile.csv'
>>
>>
>> So there seem to be one part of the issue in scan(), which for some
>> reason does not work when passed fileEncoding="UTF-8"; and another part
>> in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.",
>> probably via make.names(), since:
>> make.names("\U531")
>> # "X.U.0531."
>>
>>
>> Does this make sense to R-core members?
>>
>>
>> Regards
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list