[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?

Wed May 3 19:21:58 CEST 2017

Now fixed in R-devel revision 72650.

Duncan Murdoch

On 02/05/2017 4:11 AM, Duncan Murdoch wrote:
> On 01/05/2017 8:49 PM, Jack Kelley wrote:
>> Thanks for looking into this.
>>
>> A few notes regarding all the UTF encodings on Windows 10 ...
>
> This all stems from the ancient bad decision by Microsoft to translate
> LF characters to CR LF when writing text files.  R passes 0A or 0A 00 or
> 0A 00 00 00 to the output routine (part of the C run-time), and it needs
> to figure out how many characters there are in those bytes in order to
> add the appropriate CR with the right width.
>
> The default is 8 bit, so you get 0D 0A in current versions of R,
> regardless of the encoding.
>
> There are ways to declare UTF-16LE (see
> https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx, or Google
> "Windows fopen" if that moves), but no other wide encoding.  That's what
> I'm putting in place if you ask for UTF-16LE or UCS-2LE.  So far I'm not
> planning to handle UTF-16BE or UTF-32, because doing those would mean R
> would have to handle the translation of LF itself, and I'm too lazy to
> do that.
>
> So far this is working for writes, but not reads.  I still have to track
> down what's going wrong there.
>
> Duncan Murdoch
>
>>
>> The default eol for write.csv (via write.table) is "\n" and always gives
>> as.raw (c (0x0d, 0x0a)), that is, <Carriage Return> <Line Feed> as adjacent
>> bytes. This is fine for UTF-8 but wrong for UTF-16 and UTF-32.
>>
>> EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are
>> missing in the final CR+LF):
>>
>> df <- data.frame (x = 1:2, y = 3:4)
>>
>> $`UTF-32LE`$default.eol$raw
>>  [1] 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 00 00
>> 22
>> [26] 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a 00 00
>> 00
>> [51] 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a 00 00 00
>>
>> $`UTF-32BE`$default.eol$raw
>>  [1] 00 00 00 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79
>> 00
>> [26] 00 00 22 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d
>> 0a
>> [51] 00 00 00 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a
>>
>> (Nevertheless, Microsoft Excel 2013 tolerates these CSVs!)
>>
>> One trick/solution is to use eol = "\r" (that is, <Carriage Return> only).
>>
>> Regards -- Jack Kelley
>>
>> ----------------------------------------------------------------------------
>> --------
>>
>> remove (list = objects())
>> print (sessionInfo())
>> cat ("##########################################################\n\n")
>>
>> ENCODING <- c (
>>   "UTF-8",
>>   "UTF-16LE", "UTF-16BE", "UTF-16",
>>   "UTF-32LE", "UTF-32BE", "UTF-32"
>> )
>>
>> df <- data.frame (x = 1:2, y = 3:4)
>>
>> csv <- structure (lapply (ENCODING, function (encoding) {
>>   csv <- sprintf ("df_%s.csv", encoding)
>>   write.csv (df, csv, fileEncoding = encoding, row.names = FALSE)
>>   list (default.eol = list (
>>     csv = csv, raw = readBin (csv, "raw", 1000))
>>   )
>> }), .Names = ENCODING)
>>
>> EOL <- c (LF = "\n", CR = "\r", "CR+LF" = "\r\n")
>>
>> CSV <- structure (lapply (ENCODING, function (encoding) {
>>   structure (
>>     lapply (names (EOL), function (EOL.name) {
>>       csv <- sprintf ("df_%s_eol=%s.csv", encoding, EOL.name)
>>       write.csv (
>>         df, csv, fileEncoding = encoding, row.names = FALSE,
>>         eol = EOL [EOL.name]
>>       )
>>       list (csv = csv, raw = readBin (csv, "raw", 1000))
>>   }), .Names = names (EOL))
>> }), .Names = ENCODING)
>>
>> print (csv)
>> print (CSV)
>>
>> ----------------------------------------------------------------------------
>> ----------------
>>
>> -----Original Message-----
>> From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com]
>> Sent: Tuesday, 2 May 2017 04:22
>> To: Jack Kelley <Jack.Kelley at bigpond.com>; r-devel at r-project.org
>> Subject: Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and
>> UTF-32 ?
>>
>> On 30/04/2017 12:23 PM, Duncan Murdoch wrote:
>>> No, I don't think anyone is working on this.
>>>
>>> There's a fairly simple workaround for the UTF-16 and UTF-32 iconv
>>> issues:  don't attempt to produce character vectors, produce raw vectors
>>> instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors
>>> can contain embedded nulls.  Character vectors can't, because
>>> internally, R is using 8 bit C strings, and the nulls are string
>>> terminators.
>>>
>>> I don't know how difficult it would be to fix the write.table problems.
>>
>> I've now taken a look, and it appears as if it's not too hard.  I'll see
>> if I can work out a patch that I trust.
>>
>> Duncan Murdoch
>>
>>>
>>> Duncan Murdoch
>>>
>>> On 29/04/2017 7:53 PM, Jack Kelley wrote:
>>>> "R version 3.4.0 (2017-04-21)"  on "x86_64-w64-mingw32" platform
>>>> ... [rest omitted]
>>
>>
>