[R] Non-ACSII characters in R on Windows

Ista Zahn istazahn at gmail.com
Mon Sep 16 21:35:35 CEST 2013


Hi Duncan,

I've put an example file online at
https://docs.google.com/file/d/0B73Ve8vxnjR6QnRESXBQTHRUME0/edit?usp=sharing,
with a screenshot showing the expected contents of the file at
https://docs.google.com/file/d/0B73Ve8vxnjR6b1ZSQmtsRXdadVU/edit?usp=sharing

Hopefully you'll find this easy and the rest of us can feel dumb for
not having figured it out...

Thanks,
Ista

On Mon, Sep 16, 2013 at 1:39 PM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
> On 16/09/2013 12:04 PM, Maxim Linchits wrote:
>>
>> Here is that old post:
>>
>> http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html
>
>
> In that post, you'll see I asked for a sample file.  I never received any
> reply; presumably some spam filter didn't like what Alexander sent me, and
> Nabble doesn't archive any attachment.
>
> Similarly, the Stackoverflow thread contains no sample data.
>
> Could someone who is having this problem please put a small sample online
> for download?  As I told Alexander last time, my experiments with files I
> constructed myself showed no errors.
>
> Duncan Murdoch
>
>
>>
>> A taste: "Again, the issue is that opening this UTF-8 encoded file
>> under R 2.13.0 yields an error, but opening it under R 2.12.2 works
>> without any issues. (...)"
>>
>> On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat <nalimilan at club.fr>
>> wrote:
>> > Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
>> >> Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
>> >> > This is a condensed version of the same question on stackexchange
>> >> > here:
>> >> >
>> >> > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
>> >> > If you've already stumbled upon it feel free to ignore.
>> >> >
>> >> > My problem is that R on US Windows does not read *any* text file that
>> >> > contains *any* foreign characters. It simply reads the first
>> >> > consecutive n
>> >> > ASCII characters and then throws a warning once it reached a foreign
>> >> > character:
>> >> >
>> >> > > test <- read.table("test.txt", sep=";", dec=",", quote="",
>> >> > fileEncoding="UTF-8")
>> >> > Warning messages:
>> >> > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "",
>> >> > fileEncoding
>> >> > = "UTF-8") :
>> >> >   invalid input found on input connection 'test.txt'
>> >> > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "",
>> >> > fileEncoding
>> >> > = "UTF-8") :
>> >> >   incomplete final line found by readTableHeader on 'test.txt'
>> >> > > print(test)
>> >> >        V1
>> >> > 1 english
>> >> >
>> >> > > Sys.getlocale()
>> >> >    [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>> >> > States.1252;
>> >> >      LC_MONETARY=English_United
>> >> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>> >> >
>> >> >
>> >> > It is important to note that that R on linux will read UTF-8 as well
>> >> > as
>> >> > exotic character sets without a problem. I've tried it with the exact
>> >> > same
>> >> > files (one was UTF-8 and another was OEM866 Cyrillic).
>> >> >
>> >> > If I do not include the fileEncoding parameter, read.table will read
>> >> > the
>> >> > whole CSV file. But naturally it will read it wrong because it does
>> >> > not
>> >> > know the encoding. So whenever I try to specify the fileEncoding, R
>> >> > will
>> >> > throw the warnings and stop once it reaches a foreign character. It's
>> >> > the
>> >> > same story with all international character encodings.
>> >> > Other users on stackexchange have reported exactly the same issue.
>> >> >
>> >> >
>> >> > Is anyone here who is on a US version of Windows able to import files
>> >> > with
>> >> > foreign characters? Please let me know.
>> >> A reproducible example would have helped, as requested by the posting
>> >> guide.
>> >>
>> >> Though I am also experiencing the same problem after saving the data
>> >> below to a CSV file encoded in UTF-8 (you can do this using even the
>> >> Notepad):
>> >> "Ա","Բ"
>> >> 1,10
>> >> 2,20
>> >>
>> >> This is on a Windows 7 box using French locale, but same codepage 1252
>> >> as yours. What is interesting is that reading the file using
>> >> readLines(file("myFile.csv", encoding="UTF-8"))
>> >> gives no invalid characters. So there must be a bug in read.table().
>> >>
>> >>
>> >> But I must note I do not experience issues with French accentuated
>> >> characters like "é" ("\Ue9"). On the contrary, reading Armenian
>> >> characters like "Ա" ("\U531") gives weird results: the character
>> >> appears
>> >> as <U+0531> instead of Ա.
>> >>
>> >> Self-contained example, writing the file and reading it back from R:
>> >> tmpfile <- tempfile()
>> >> writeLines("\U531", file(tmpfile, "w", encoding="UTF-8"))
>> >> readLines(file(tmpfile, encoding="UTF-8"))
>> >> # "<U+0531>"
>> >>
>> >> The same phenomenon happens when creating a data frame from this
>> >> character (as noted on StackExchange):
>> >> data.frame("\U531")
>> >>
>> >> So my conclusion is that maybe Windows does not really support Unicode
>> >> characters that are not "relevant" for your current locale. And that
>> >> may
>> >> have created bugs in the way R handles them in read.table(). R
>> >> developers can probably tell us more about it.
>> > After some more investigation, one part of the problem can be traced
>> > back to scan() (with myFile.csv filled as described above):
>> > scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1)
>> > # Read 2 items
>> > # [1] "Ա" "Բ"
>> >
>> > Equivalent, but nonsensical to me:
>> > scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",",
>> > nlines=1)
>> > # Read 2 items
>> > # [1] "Ա" "Բ"
>> >
>> > scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1)
>> > # Read 0 items
>> > # character(0)
>> > # Warning message:
>> > # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,
>> > :
>> > #  invalid input found on input connection 'myFile.csv'
>> >
>> >
>> > So there seem to be one part of the issue in scan(), which for some
>> > reason does not work when passed fileEncoding="UTF-8"; and another part
>> > in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.",
>> > probably via make.names(), since:
>> > make.names("\U531")
>> > # "X.U.0531."
>> >
>> >
>> > Does this make sense to R-core members?
>> >
>> >
>> > Regards
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list