[R] Non-ACSII characters in R on Windows

Mon Sep 16 19:39:53 CEST 2013

On 16/09/2013 12:04 PM, Maxim Linchits wrote:
> Here is that old post:
> http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html

In that post, you'll see I asked for a sample file.  I never received 
any reply; presumably some spam filter didn't like what Alexander sent 
me, and Nabble doesn't archive any attachment.

Similarly, the Stackoverflow thread contains no sample data.

Could someone who is having this problem please put a small sample 
online for download?  As I told Alexander last time, my experiments with 
files I constructed myself showed no errors.

Duncan Murdoch

>
> A taste: "Again, the issue is that opening this UTF-8 encoded file
> under R 2.13.0 yields an error, but opening it under R 2.12.2 works
> without any issues. (...)"
>
> On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> > Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
> >> Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
> >> > This is a condensed version of the same question on stackexchange here:
> >> > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
> >> > If you've already stumbled upon it feel free to ignore.
> >> >
> >> > My problem is that R on US Windows does not read *any* text file that
> >> > contains *any* foreign characters. It simply reads the first consecutive n
> >> > ASCII characters and then throws a warning once it reached a foreign
> >> > character:
> >> >
> >> > > test <- read.table("test.txt", sep=";", dec=",", quote="",
> >> > fileEncoding="UTF-8")
> >> > Warning messages:
> >> > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
> >> > = "UTF-8") :
> >> >   invalid input found on input connection 'test.txt'
> >> > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
> >> > = "UTF-8") :
> >> >   incomplete final line found by readTableHeader on 'test.txt'
> >> > > print(test)
> >> >        V1
> >> > 1 english
> >> >
> >> > > Sys.getlocale()
> >> >    [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> >> > States.1252;
> >> >      LC_MONETARY=English_United
> >> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> >> >
> >> >
> >> > It is important to note that that R on linux will read UTF-8 as well as
> >> > exotic character sets without a problem. I've tried it with the exact same
> >> > files (one was UTF-8 and another was OEM866 Cyrillic).
> >> >
> >> > If I do not include the fileEncoding parameter, read.table will read the
> >> > whole CSV file. But naturally it will read it wrong because it does not
> >> > know the encoding. So whenever I try to specify the fileEncoding, R will
> >> > throw the warnings and stop once it reaches a foreign character. It's the
> >> > same story with all international character encodings.
> >> > Other users on stackexchange have reported exactly the same issue.
> >> >
> >> >
> >> > Is anyone here who is on a US version of Windows able to import files with
> >> > foreign characters? Please let me know.
> >> A reproducible example would have helped, as requested by the posting
> >> guide.
> >>
> >> Though I am also experiencing the same problem after saving the data
> >> below to a CSV file encoded in UTF-8 (you can do this using even the
> >> Notepad):
> >> "Ա","Բ"
> >> 1,10
> >> 2,20
> >>
> >> This is on a Windows 7 box using French locale, but same codepage 1252
> >> as yours. What is interesting is that reading the file using
> >> readLines(file("myFile.csv", encoding="UTF-8"))
> >> gives no invalid characters. So there must be a bug in read.table().
> >>
> >>
> >> But I must note I do not experience issues with French accentuated
> >> characters like "é" ("\Ue9"). On the contrary, reading Armenian
> >> characters like "Ա" ("\U531") gives weird results: the character appears
> >> as <U+0531> instead of Ա.
> >>
> >> Self-contained example, writing the file and reading it back from R:
> >> tmpfile <- tempfile()
> >> writeLines("\U531", file(tmpfile, "w", encoding="UTF-8"))
> >> readLines(file(tmpfile, encoding="UTF-8"))
> >> # "<U+0531>"
> >>
> >> The same phenomenon happens when creating a data frame from this
> >> character (as noted on StackExchange):
> >> data.frame("\U531")
> >>
> >> So my conclusion is that maybe Windows does not really support Unicode
> >> characters that are not "relevant" for your current locale. And that may
> >> have created bugs in the way R handles them in read.table(). R
> >> developers can probably tell us more about it.
> > After some more investigation, one part of the problem can be traced
> > back to scan() (with myFile.csv filled as described above):
> > scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1)
> > # Read 2 items
> > # [1] "Ա" "Բ"
> >
> > Equivalent, but nonsensical to me:
> > scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",", nlines=1)
> > # Read 2 items
> > # [1] "Ա" "Բ"
> >
> > scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1)
> > # Read 0 items
> > # character(0)
> > # Warning message:
> > # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
> > #  invalid input found on input connection 'myFile.csv'
> >
> >
> > So there seem to be one part of the issue in scan(), which for some
> > reason does not work when passed fileEncoding="UTF-8"; and another part
> > in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.",
> > probably via make.names(), since:
> > make.names("\U531")
> > # "X.U.0531."
> >
> >
> > Does this make sense to R-core members?
> >
> >
> > Regards
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.