[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Wed Feb 24 21:49:58 CET 2016

On 24/02/2016 11:16 AM, Duncan Murdoch wrote:
> On 24/02/2016 9:55 AM, Mikko Korpela wrote:
>> On 24.02.2016 15:47, Duncan Murdoch wrote:
>>> On 23/02/2016 7:06 AM, Mikko Korpela wrote:
>>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>>>        on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>>>
>>>>>        > Dear R developers
>>>>>        > I think I have found a bug that can be reproduced with two
>>>>> lines of code
>>>>>        > and I am very thankful to get your first assessment or
>>>>> feed-back on my
>>>>>        > report.
>>>>>
>>>>>        > If this is the wrong mailing list or I did something wrong
>>>>>        > (e. g. semi "anonymous" email address to protect my privacy
>>>>> and defend
>>>>>        > unwanted spam) please let me know since I am new here.
>>>>>
>>>>>        > Thank you very much :-)
>>>>>
>>>>>        > J. Altfeld
>>>>>
>>>>> Dear J.,
>>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>>
>>>>> You are right, this is a bug, at least in the documentation, but
>>>>> probably "all real", indeed,
>>>>>
>>>>> but read on.
>>>>>
>>>>>        > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>>>        >>
>>>>>        >>
>>>>>        >> If I execute the code from the "?write.table" examples section
>>>>>        >>
>>>>>        >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>>>        >> # (ommited code)
>>>>>        >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>>>        >>
>>>>>        >> the resulting CSV file has a size of 6 bytes which is too short
>>>>>        >> (truncated):
>>>>>        >>
>>>>>        >> """,3
>>>>>
>>>>> reproducibly, yes.
>>>>> If you look at what write.csv does
>>>>> and then simplify, you can get a similar wrong result by
>>>>>
>>>>>      write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>>>
>>>>> which results in a file with one line
>>>>>
>>>>> """ 3
>>>>>
>>>>> and if you debug  write.table() you see that its building blocks
>>>>> here are
>>>>>        file <- file(........, encoding = fileEncoding)
>>>>>
>>>>> a      writeLines(*, file=file)  for the column headers,
>>>>>
>>>>> and then "deeper down" C code which I did not investigate.
>>>>
>>>> I took a look at connections.c. There is a call to strlen() that gets
>>>> confused by null characters. I think the obvious fix is to avoid the
>>>> call to strlen() as the size is already known:
>>>>
>>>> Index: src/main/connections.c
>>>> ===================================================================
>>>> --- src/main/connections.c    (revision 70213)
>>>> +++ src/main/connections.c    (working copy)
>>>> @@ -369,7 +369,7 @@
>>>>             /* is this safe? */
>>>>             warning(_("invalid char string in output conversion"));
>>>>             *ob = '\0';
>>>> -        con->write(outbuf, 1, strlen(outbuf), con);
>>>> +        con->write(outbuf, 1, ob - outbuf, con);
>>>>         } while(again && inb > 0);  /* it seems some iconv signal -1 on
>>>>                            zero-length input */
>>>>         } else
>>>>
>>>>
>>>>>
>>>>> But just looking a bit at such a file() object with writeLines()
>>>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>>>> "work" for this encoding:
>>>>>
>>>>>        > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding =
>>>>> "UTF-16LE")
>>>>>        > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
>>>>> writeLines(">a", ff)
>>>>>        > close(ff)
>>>>>        > file.show(fn)
>>>>>        CBA|>
>>>>>        > file.size(fn)
>>>>>        [1] 5
>>>>>        >
>>>>
>>>> With the patch applied:
>>>>
>>>>        > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>>>        [1] "C"  "B"  "A"  "|"  ">a"
>>>>        > file.size(fn)
>>>>        [1] 22
>>>
>>> That may be okay on Unix, but it's not enough on Windows.  There the \n
>>> that writeLines adds at the end of each line isn't translated to
>>> UTF-16LE properly, so things get messed up.  (I think the \n is
>>> translated, but the \r that Windows wants is not, so you get a mix of 8
>>> bit and 16 bit characters.)
>>
>> That's unfortunate. I tested my tiny patch on Linux. I don't know what
>> kind of additional changes would be needed to make this work on Windows.
>>
>
> It looks like a big change is needed for a perfect solution:
>
>    - Windows does the translation of \n to \r\n.  In the R code, Windows
> is never told that the output is UTF-16LE, so it does an 8 bit translation.
>
>    - Telling Windows that output is UTF-16LE looks hard:  we'd need to
> convert the string to wide chars in R, then write it in wide chars.
> This seems like a lot of work for a rare case.
>
>    - It might be easier to do a hack:  if the user asks for "UTF-16LE",
> then treat it internally as a text file but tell Windows it's a binary
> file.  This means no \n to \r\n translation will be done by Windows.  If
> the desired output file needs Windows line endings, the user would have
> to specify sep="\r\n" in writeLines.

A third possibility is to handle the insertion of the \r completely 
within R.  This will have the advantage of making it optional, so it 
would be a lot easier to write a Unix-style file on Windows.

I think either the first or third possibilities will take too much time 
for me to attempt them before 3.3.0.  I'm not sure about the second one yet.

Duncan Murdoch