[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
Duncan Murdoch
murdoch.duncan at gmail.com
Wed Feb 24 17:16:12 CET 2016
On 24/02/2016 9:55 AM, Mikko Korpela wrote:
> On 24.02.2016 15:47, Duncan Murdoch wrote:
>> On 23/02/2016 7:06 AM, Mikko Korpela wrote:
>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>>
>>>> > Dear R developers
>>>> > I think I have found a bug that can be reproduced with two
>>>> lines of code
>>>> > and I am very thankful to get your first assessment or
>>>> feed-back on my
>>>> > report.
>>>>
>>>> > If this is the wrong mailing list or I did something wrong
>>>> > (e. g. semi "anonymous" email address to protect my privacy
>>>> and defend
>>>> > unwanted spam) please let me know since I am new here.
>>>>
>>>> > Thank you very much :-)
>>>>
>>>> > J. Altfeld
>>>>
>>>> Dear J.,
>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>
>>>> You are right, this is a bug, at least in the documentation, but
>>>> probably "all real", indeed,
>>>>
>>>> but read on.
>>>>
>>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>> >>
>>>> >>
>>>> >> If I execute the code from the "?write.table" examples section
>>>> >>
>>>> >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>> >> # (ommited code)
>>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>> >>
>>>> >> the resulting CSV file has a size of 6 bytes which is too short
>>>> >> (truncated):
>>>> >>
>>>> >> """,3
>>>>
>>>> reproducibly, yes.
>>>> If you look at what write.csv does
>>>> and then simplify, you can get a similar wrong result by
>>>>
>>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>>
>>>> which results in a file with one line
>>>>
>>>> """ 3
>>>>
>>>> and if you debug write.table() you see that its building blocks
>>>> here are
>>>> file <- file(........, encoding = fileEncoding)
>>>>
>>>> a writeLines(*, file=file) for the column headers,
>>>>
>>>> and then "deeper down" C code which I did not investigate.
>>>
>>> I took a look at connections.c. There is a call to strlen() that gets
>>> confused by null characters. I think the obvious fix is to avoid the
>>> call to strlen() as the size is already known:
>>>
>>> Index: src/main/connections.c
>>> ===================================================================
>>> --- src/main/connections.c (revision 70213)
>>> +++ src/main/connections.c (working copy)
>>> @@ -369,7 +369,7 @@
>>> /* is this safe? */
>>> warning(_("invalid char string in output conversion"));
>>> *ob = '\0';
>>> - con->write(outbuf, 1, strlen(outbuf), con);
>>> + con->write(outbuf, 1, ob - outbuf, con);
>>> } while(again && inb > 0); /* it seems some iconv signal -1 on
>>> zero-length input */
>>> } else
>>>
>>>
>>>>
>>>> But just looking a bit at such a file() object with writeLines()
>>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>>> "work" for this encoding:
>>>>
>>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding =
>>>> "UTF-16LE")
>>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
>>>> writeLines(">a", ff)
>>>> > close(ff)
>>>> > file.show(fn)
>>>> CBA|>
>>>> > file.size(fn)
>>>> [1] 5
>>>> >
>>>
>>> With the patch applied:
>>>
>>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>> [1] "C" "B" "A" "|" ">a"
>>> > file.size(fn)
>>> [1] 22
>>
>> That may be okay on Unix, but it's not enough on Windows. There the \n
>> that writeLines adds at the end of each line isn't translated to
>> UTF-16LE properly, so things get messed up. (I think the \n is
>> translated, but the \r that Windows wants is not, so you get a mix of 8
>> bit and 16 bit characters.)
>
> That's unfortunate. I tested my tiny patch on Linux. I don't know what
> kind of additional changes would be needed to make this work on Windows.
>
It looks like a big change is needed for a perfect solution:
- Windows does the translation of \n to \r\n. In the R code, Windows
is never told that the output is UTF-16LE, so it does an 8 bit translation.
- Telling Windows that output is UTF-16LE looks hard: we'd need to
convert the string to wide chars in R, then write it in wide chars.
This seems like a lot of work for a rare case.
- It might be easier to do a hack: if the user asks for "UTF-16LE",
then treat it internally as a text file but tell Windows it's a binary
file. This means no \n to \r\n translation will be done by Windows. If
the desired output file needs Windows line endings, the user would have
to specify sep="\r\n" in writeLines.
Duncan Murdoch
More information about the R-devel
mailing list