[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
Jack Kelley
Jack.Kelley at bigpond.com
Tue May 2 02:49:28 CEST 2017
Thanks for looking into this.
A few notes regarding all the UTF encodings on Windows 10 ...
The default eol for write.csv (via write.table) is "\n" and always gives
as.raw (c (0x0d, 0x0a)), that is, <Carriage Return> <Line Feed> as adjacent
bytes. This is fine for UTF-8 but wrong for UTF-16 and UTF-32.
EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are
missing in the final CR+LF):
df <- data.frame (x = 1:2, y = 3:4)
$`UTF-32LE`$default.eol$raw
[1] 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 00 00
22
[26] 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a 00 00
00
[51] 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a 00 00 00
$`UTF-32BE`$default.eol$raw
[1] 00 00 00 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79
00
[26] 00 00 22 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d
0a
[51] 00 00 00 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a
(Nevertheless, Microsoft Excel 2013 tolerates these CSVs!)
One trick/solution is to use eol = "\r" (that is, <Carriage Return> only).
Regards -- Jack Kelley
----------------------------------------------------------------------------
--------
remove (list = objects())
print (sessionInfo())
cat ("##########################################################\n\n")
ENCODING <- c (
"UTF-8",
"UTF-16LE", "UTF-16BE", "UTF-16",
"UTF-32LE", "UTF-32BE", "UTF-32"
)
df <- data.frame (x = 1:2, y = 3:4)
csv <- structure (lapply (ENCODING, function (encoding) {
csv <- sprintf ("df_%s.csv", encoding)
write.csv (df, csv, fileEncoding = encoding, row.names = FALSE)
list (default.eol = list (
csv = csv, raw = readBin (csv, "raw", 1000))
)
}), .Names = ENCODING)
EOL <- c (LF = "\n", CR = "\r", "CR+LF" = "\r\n")
CSV <- structure (lapply (ENCODING, function (encoding) {
structure (
lapply (names (EOL), function (EOL.name) {
csv <- sprintf ("df_%s_eol=%s.csv", encoding, EOL.name)
write.csv (
df, csv, fileEncoding = encoding, row.names = FALSE,
eol = EOL [EOL.name]
)
list (csv = csv, raw = readBin (csv, "raw", 1000))
}), .Names = names (EOL))
}), .Names = ENCODING)
print (csv)
print (CSV)
----------------------------------------------------------------------------
----------------
-----Original Message-----
From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com]
Sent: Tuesday, 2 May 2017 04:22
To: Jack Kelley <Jack.Kelley at bigpond.com>; r-devel at r-project.org
Subject: Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and
UTF-32 ?
On 30/04/2017 12:23 PM, Duncan Murdoch wrote:
> No, I don't think anyone is working on this.
>
> There's a fairly simple workaround for the UTF-16 and UTF-32 iconv
> issues: don't attempt to produce character vectors, produce raw vectors
> instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors
> can contain embedded nulls. Character vectors can't, because
> internally, R is using 8 bit C strings, and the nulls are string
> terminators.
>
> I don't know how difficult it would be to fix the write.table problems.
I've now taken a look, and it appears as if it's not too hard. I'll see
if I can work out a patch that I trust.
Duncan Murdoch
>
> Duncan Murdoch
>
> On 29/04/2017 7:53 PM, Jack Kelley wrote:
>> "R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform
>> ... [rest omitted]
More information about the R-devel
mailing list