[Rd] Writing UTF8 on Windows

Milan Bouchet-Valat nalimilan at club.fr
Sun Oct 19 21:32:04 CEST 2014


Le samedi 18 octobre 2014 à 20:49 -0700, Jeroen Ooms a écrit :
> Recent functionality in jsonlite allows for streaming json to a user
> supplied connection object, such as a file, pipe or socket. RFC7159
> prescribes json must be encoded as unicode; ISO-8859 (including
> latin1) is invalid. Hence I would like R to write strings as utf8,
> irrespective of the type of connection, platform or locale.
> Implementing this turns out to be unsurprisingly difficult on windows.
> 
> > string <- enc2utf8("Zürich")
> > Encoding(string)
> [1] "UTF-8"
> 
> For example when writing the utf8 string to a binary utf8 binary
> connection, the output seems to be latin1:
> 
> > con <- file("test1.txt", open="wb", encoding = "UTF-8")
> > writeLines(string, con)
> > close(con)
> > system("file test1.txt")
> test1.txt: ISO-8859 text
> > readLines("test1.txt", encoding="UTF-8")
> [1] "Z\xfcrich"
The encoding argument doesn't do what you (quite logically I should say)
expect. You should create a connection just like you did above, and call
readLines() on that.

That may not fix your problem, though, since 'file' says the file is
ISO-8859-1. But sometimes 'file' may be mistaken since many bytes are
common to both encodings, so better check.

> I am not quite sure if this is a bug or expected. To avoid this and
> other problems, jsonlite uses the 'useBytes` argument, which is
> supposed to suppress re-encoding when writing to the connection. This
> is exactly what we need: use enc2utf8 to convert our string to utf8
> and then pass it byte-by-byte to the connection:
> 
> > con <- file("test2.txt", open="wb", encoding = "UTF-8")
> > writeLines(string, con, useBytes = TRUE)
> > close(con)
> > system("file test2.txt")
> test2.txt: UTF-8 Unicode text
> > readLines("test2.txt", encoding="UTF-8")
> [1] "Zürich"
> 
> However useByes results in incorrect output for non-binary
> connections. Not sure what is the intention here but it looks as if
> the string gets re-encoded one time too often:
> 
> > con <- file("test3.txt", open="w", encoding = "UTF-8")
> > writeLines(string, con, useBytes = TRUE)
> > close(con)
> > system("file test3.txt")
> test3.txt: UTF-8 Unicode text, with CRLF line terminators
> > readLines("test3.txt", encoding="UTF-8")
> [1] "Zürich
Same here.

> Strangely we do get utf8 output of we set the encoding of the
> connection to latin1. This suggests that there *is* some re-encoding
> going on, in contrast to what the useBytes manual states.
> 
> > con <- file("test4.txt", open="w", encoding = "latin1")
> > writeLines(string, con, useBytes = TRUE)
> > close(con)
> > system("file test4.txt")
> test4.txt: UTF-8 Unicode text, with CRLF line terminators
> > readLines("test4.txt", encoding="UTF-8")
> [1] "Zürich"
> 
> However useBytes is definitely not ignored either, because disabling
> it will (now correctly) write latin1 again:
> 
> > con <- file("test5.txt", open="w", encoding = "latin1")
> > writeLines(string, con, useBytes = FALSE)
> > close(con)
> > system("file test5.txt")
> test5.txt: ISO-8859 text, with CRLF line terminators
> > readLines("test5.txt", encoding="UTF-8")
> [1] "Z\xfcrich"
Same here: you're reading the ISO-8859-1 data, and then without
re-encoding, considering it as UTF-8. This cannot be correct.

> I am going to stop here. My primary question is: what is the best
> method to write a utf8 string as utf8 to an arbitrary connection
> object, without any re-encoding, that works on any platform and
> locale.
Have you tried using writeBin()? It's documentation seems to imply it
could do what you want. But maybe writeLines() is enough, as the
problems you had above are more about reading than about writing.

Others should be able to give you more informed answers about writing.


Regards



More information about the R-devel mailing list