[Rd] Writing UTF8 on Windows
Jeroen Ooms
jeroen.ooms at stat.ucla.edu
Sun Oct 19 05:49:43 CEST 2014
Recent functionality in jsonlite allows for streaming json to a user
supplied connection object, such as a file, pipe or socket. RFC7159
prescribes json must be encoded as unicode; ISO-8859 (including
latin1) is invalid. Hence I would like R to write strings as utf8,
irrespective of the type of connection, platform or locale.
Implementing this turns out to be unsurprisingly difficult on windows.
> string <- enc2utf8("Zürich")
> Encoding(string)
[1] "UTF-8"
For example when writing the utf8 string to a binary utf8 binary
connection, the output seems to be latin1:
> con <- file("test1.txt", open="wb", encoding = "UTF-8")
> writeLines(string, con)
> close(con)
> system("file test1.txt")
test1.txt: ISO-8859 text
> readLines("test1.txt", encoding="UTF-8")
[1] "Z\xfcrich"
I am not quite sure if this is a bug or expected. To avoid this and
other problems, jsonlite uses the 'useBytes` argument, which is
supposed to suppress re-encoding when writing to the connection. This
is exactly what we need: use enc2utf8 to convert our string to utf8
and then pass it byte-by-byte to the connection:
> con <- file("test2.txt", open="wb", encoding = "UTF-8")
> writeLines(string, con, useBytes = TRUE)
> close(con)
> system("file test2.txt")
test2.txt: UTF-8 Unicode text
> readLines("test2.txt", encoding="UTF-8")
[1] "Zürich"
However useByes results in incorrect output for non-binary
connections. Not sure what is the intention here but it looks as if
the string gets re-encoded one time too often:
> con <- file("test3.txt", open="w", encoding = "UTF-8")
> writeLines(string, con, useBytes = TRUE)
> close(con)
> system("file test3.txt")
test3.txt: UTF-8 Unicode text, with CRLF line terminators
> readLines("test3.txt", encoding="UTF-8")
[1] "Zürich
Strangely we do get utf8 output of we set the encoding of the
connection to latin1. This suggests that there *is* some re-encoding
going on, in contrast to what the useBytes manual states.
> con <- file("test4.txt", open="w", encoding = "latin1")
> writeLines(string, con, useBytes = TRUE)
> close(con)
> system("file test4.txt")
test4.txt: UTF-8 Unicode text, with CRLF line terminators
> readLines("test4.txt", encoding="UTF-8")
[1] "Zürich"
However useBytes is definitely not ignored either, because disabling
it will (now correctly) write latin1 again:
> con <- file("test5.txt", open="w", encoding = "latin1")
> writeLines(string, con, useBytes = FALSE)
> close(con)
> system("file test5.txt")
test5.txt: ISO-8859 text, with CRLF line terminators
> readLines("test5.txt", encoding="UTF-8")
[1] "Z\xfcrich"
I am going to stop here. My primary question is: what is the best
method to write a utf8 string as utf8 to an arbitrary connection
object, without any re-encoding, that works on any platform and
locale.
More information about the R-devel
mailing list