[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
Duncan Murdoch
murdoch.duncan at gmail.com
Wed Feb 24 21:49:58 CET 2016
On 24/02/2016 11:16 AM, Duncan Murdoch wrote:
> On 24/02/2016 9:55 AM, Mikko Korpela wrote:
>> On 24.02.2016 15:47, Duncan Murdoch wrote:
>>> On 23/02/2016 7:06 AM, Mikko Korpela wrote:
>>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>>>
>>>>> > Dear R developers
>>>>> > I think I have found a bug that can be reproduced with two
>>>>> lines of code
>>>>> > and I am very thankful to get your first assessment or
>>>>> feed-back on my
>>>>> > report.
>>>>>
>>>>> > If this is the wrong mailing list or I did something wrong
>>>>> > (e. g. semi "anonymous" email address to protect my privacy
>>>>> and defend
>>>>> > unwanted spam) please let me know since I am new here.
>>>>>
>>>>> > Thank you very much :-)
>>>>>
>>>>> > J. Altfeld
>>>>>
>>>>> Dear J.,
>>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>>
>>>>> You are right, this is a bug, at least in the documentation, but
>>>>> probably "all real", indeed,
>>>>>
>>>>> but read on.
>>>>>
>>>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>>> >>
>>>>> >>
>>>>> >> If I execute the code from the "?write.table" examples section
>>>>> >>
>>>>> >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>>> >> # (ommited code)
>>>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>>> >>
>>>>> >> the resulting CSV file has a size of 6 bytes which is too short
>>>>> >> (truncated):
>>>>> >>
>>>>> >> """,3
>>>>>
>>>>> reproducibly, yes.
>>>>> If you look at what write.csv does
>>>>> and then simplify, you can get a similar wrong result by
>>>>>
>>>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>>>
>>>>> which results in a file with one line
>>>>>
>>>>> """ 3
>>>>>
>>>>> and if you debug write.table() you see that its building blocks
>>>>> here are
>>>>> file <- file(........, encoding = fileEncoding)
>>>>>
>>>>> a writeLines(*, file=file) for the column headers,
>>>>>
>>>>> and then "deeper down" C code which I did not investigate.
>>>>
>>>> I took a look at connections.c. There is a call to strlen() that gets
>>>> confused by null characters. I think the obvious fix is to avoid the
>>>> call to strlen() as the size is already known:
>>>>
>>>> Index: src/main/connections.c
>>>> ===================================================================
>>>> --- src/main/connections.c (revision 70213)
>>>> +++ src/main/connections.c (working copy)
>>>> @@ -369,7 +369,7 @@
>>>> /* is this safe? */
>>>> warning(_("invalid char string in output conversion"));
>>>> *ob = '\0';
>>>> - con->write(outbuf, 1, strlen(outbuf), con);
>>>> + con->write(outbuf, 1, ob - outbuf, con);
>>>> } while(again && inb > 0); /* it seems some iconv signal -1 on
>>>> zero-length input */
>>>> } else
>>>>
>>>>
>>>>>
>>>>> But just looking a bit at such a file() object with writeLines()
>>>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>>>> "work" for this encoding:
>>>>>
>>>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding =
>>>>> "UTF-16LE")
>>>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
>>>>> writeLines(">a", ff)
>>>>> > close(ff)
>>>>> > file.show(fn)
>>>>> CBA|>
>>>>> > file.size(fn)
>>>>> [1] 5
>>>>> >
>>>>
>>>> With the patch applied:
>>>>
>>>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>>> [1] "C" "B" "A" "|" ">a"
>>>> > file.size(fn)
>>>> [1] 22
>>>
>>> That may be okay on Unix, but it's not enough on Windows. There the \n
>>> that writeLines adds at the end of each line isn't translated to
>>> UTF-16LE properly, so things get messed up. (I think the \n is
>>> translated, but the \r that Windows wants is not, so you get a mix of 8
>>> bit and 16 bit characters.)
>>
>> That's unfortunate. I tested my tiny patch on Linux. I don't know what
>> kind of additional changes would be needed to make this work on Windows.
>>
>
> It looks like a big change is needed for a perfect solution:
>
> - Windows does the translation of \n to \r\n. In the R code, Windows
> is never told that the output is UTF-16LE, so it does an 8 bit translation.
>
> - Telling Windows that output is UTF-16LE looks hard: we'd need to
> convert the string to wide chars in R, then write it in wide chars.
> This seems like a lot of work for a rare case.
>
> - It might be easier to do a hack: if the user asks for "UTF-16LE",
> then treat it internally as a text file but tell Windows it's a binary
> file. This means no \n to \r\n translation will be done by Windows. If
> the desired output file needs Windows line endings, the user would have
> to specify sep="\r\n" in writeLines.
A third possibility is to handle the insertion of the \r completely
within R. This will have the advantage of making it optional, so it
would be a lot easier to write a Unix-style file on Windows.
I think either the first or third possibilities will take too much time
for me to attempt them before 3.3.0. I'm not sure about the second one yet.
Duncan Murdoch
More information about the R-devel
mailing list