[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?

Jack Kelley Jack.Kelley at bigpond.com
Tue May 2 02:49:28 CEST 2017


Thanks for looking into this.

A few notes regarding all the UTF encodings on Windows 10 ...

The default eol for write.csv (via write.table) is "\n" and always gives
as.raw (c (0x0d, 0x0a)), that is, <Carriage Return> <Line Feed> as adjacent
bytes. This is fine for UTF-8 but wrong for UTF-16 and UTF-32.

EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are
missing in the final CR+LF):

df <- data.frame (x = 1:2, y = 3:4)

$`UTF-32LE`$default.eol$raw
 [1] 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 00 00
22
[26] 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a 00 00
00
[51] 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a 00 00 00

$`UTF-32BE`$default.eol$raw
 [1] 00 00 00 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79
00
[26] 00 00 22 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d
0a
[51] 00 00 00 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a

(Nevertheless, Microsoft Excel 2013 tolerates these CSVs!)

One trick/solution is to use eol = "\r" (that is, <Carriage Return> only).

Regards -- Jack Kelley

----------------------------------------------------------------------------
--------

remove (list = objects())
print (sessionInfo())
cat ("##########################################################\n\n")

ENCODING <- c (
  "UTF-8",
  "UTF-16LE", "UTF-16BE", "UTF-16",
  "UTF-32LE", "UTF-32BE", "UTF-32"
)

df <- data.frame (x = 1:2, y = 3:4)

csv <- structure (lapply (ENCODING, function (encoding) {
  csv <- sprintf ("df_%s.csv", encoding)
  write.csv (df, csv, fileEncoding = encoding, row.names = FALSE)
  list (default.eol = list (
    csv = csv, raw = readBin (csv, "raw", 1000))
  )
}), .Names = ENCODING)

EOL <- c (LF = "\n", CR = "\r", "CR+LF" = "\r\n")

CSV <- structure (lapply (ENCODING, function (encoding) {
  structure (
    lapply (names (EOL), function (EOL.name) {
      csv <- sprintf ("df_%s_eol=%s.csv", encoding, EOL.name)
      write.csv (
        df, csv, fileEncoding = encoding, row.names = FALSE,
        eol = EOL [EOL.name]
      )
      list (csv = csv, raw = readBin (csv, "raw", 1000))
  }), .Names = names (EOL))
}), .Names = ENCODING)

print (csv)
print (CSV)

----------------------------------------------------------------------------
----------------

-----Original Message-----
From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com] 
Sent: Tuesday, 2 May 2017 04:22
To: Jack Kelley <Jack.Kelley at bigpond.com>; r-devel at r-project.org
Subject: Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and
UTF-32 ?

On 30/04/2017 12:23 PM, Duncan Murdoch wrote:
> No, I don't think anyone is working on this.
>
> There's a fairly simple workaround for the UTF-16 and UTF-32 iconv
> issues:  don't attempt to produce character vectors, produce raw vectors
> instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors
> can contain embedded nulls.  Character vectors can't, because
> internally, R is using 8 bit C strings, and the nulls are string
> terminators.
>
> I don't know how difficult it would be to fix the write.table problems.

I've now taken a look, and it appears as if it's not too hard.  I'll see 
if I can work out a patch that I trust.

Duncan Murdoch

>
> Duncan Murdoch
>
> On 29/04/2017 7:53 PM, Jack Kelley wrote:
>> "R version 3.4.0 (2017-04-21)"  on "x86_64-w64-mingw32" platform
>> ... [rest omitted]



More information about the R-devel mailing list