[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
Duncan Murdoch
murdoch.duncan at gmail.com
Wed Feb 24 00:25:09 CET 2016
On 23/02/2016 4:53 PM, nospam at altfeld-im.de wrote:
> Excellent analysis, thank you both for the quick reply!
>
> Is there anything I can do to get the bug fixed in the next version of R
> (e. g. filing a bug report at https://bugs.r-project.org/bugzilla3/)?
Wait a few days, and file a bug report if nothing has happened.
Duncan Murdoch
>
>
> On Tue, 2016-02-23 at 14:06 +0200, Mikko Korpela wrote:
>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>
>>> > Dear R developers
>>> > I think I have found a bug that can be reproduced with two lines of code
>>> > and I am very thankful to get your first assessment or feed-back on my
>>> > report.
>>>
>>> > If this is the wrong mailing list or I did something wrong
>>> > (e. g. semi "anonymous" email address to protect my privacy and defend
>>> > unwanted spam) please let me know since I am new here.
>>>
>>> > Thank you very much :-)
>>>
>>> > J. Altfeld
>>>
>>> Dear J.,
>>> (yes, a bit less anonymity would be very welcomed here!),
>>>
>>> You are right, this is a bug, at least in the documentation, but
>>> probably "all real", indeed,
>>>
>>> but read on.
>>>
>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>> >>
>>> >>
>>> >> If I execute the code from the "?write.table" examples section
>>> >>
>>> >> x <- data.frame(a = I("a \" quote"), b = pi)
>>> >> # (ommited code)
>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>> >>
>>> >> the resulting CSV file has a size of 6 bytes which is too short
>>> >> (truncated):
>>> >>
>>> >> """,3
>>>
>>> reproducibly, yes.
>>> If you look at what write.csv does
>>> and then simplify, you can get a similar wrong result by
>>>
>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>
>>> which results in a file with one line
>>>
>>> """ 3
>>>
>>> and if you debug write.table() you see that its building blocks
>>> here are
>>> file <- file(........, encoding = fileEncoding)
>>>
>>> a writeLines(*, file=file) for the column headers,
>>>
>>> and then "deeper down" C code which I did not investigate.
>>
>> I took a look at connections.c. There is a call to strlen() that gets
>> confused by null characters. I think the obvious fix is to avoid the
>> call to strlen() as the size is already known:
>>
>> Index: src/main/connections.c
>> ===================================================================
>> --- src/main/connections.c (revision 70213)
>> +++ src/main/connections.c (working copy)
>> @@ -369,7 +369,7 @@
>> /* is this safe? */
>> warning(_("invalid char string in output conversion"));
>> *ob = '\0';
>> - con->write(outbuf, 1, strlen(outbuf), con);
>> + con->write(outbuf, 1, ob - outbuf, con);
>> } while(again && inb > 0); /* it seems some iconv signal -1 on
>> zero-length input */
>> } else
>>
>>
>>>
>>> But just looking a bit at such a file() object with writeLines()
>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>> "work" for this encoding:
>>>
>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
>>> > close(ff)
>>> > file.show(fn)
>>> CBA|>
>>> > file.size(fn)
>>> [1] 5
>>> >
>>
>> With the patch applied:
>>
>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>> [1] "C" "B" "A" "|" ">a"
>> > file.size(fn)
>> [1] 22
>>
>> - Mikko Korpela
>>
>>> >> The problem seems to be the iconv function:
>>> >>
>>> >> iconv("foo", to="UTF-16")
>>> >>
>>> >> produces
>>> >>
>>> >> Error in iconv("foo", to = "UTF-16"):
>>> >> embedded nul in string: '\xff\xfef\0o\0o\0'
>>>
>>> but this works
>>>
>>> > iconv("foo", to="UTF-16", toRaw=TRUE)
>>> [[1]]
>>> [1] ff fe 66 00 6f 00 6f 00
>>>
>>> (indeed showing the embedded '\0's)
>>>
>>> >> In 2010 a (partial) patch for this problem was submitted:
>>> >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html
>>>
>>> the patch only related to the iconv() problem not allowing 'raw'
>>> (instead of character) argument x.
>>>
>>> ... and it is > 5.5 years old, for an iconv() version that was less
>>> featureful than today.
>>> Rather, current iconv(x) allows x to be a list of raw entries.
>>>
>>>
>>> >> Are there chances to fix this problem since it prevents writing Windows
>>> >> UTF-16LE text files?
>>>
>>> >>
>>> >> PS: This problem can be reproduced on Windows and Linux.
>>>
>>> indeed.... also on "R devel of today".
>>>
>>> I agree it should be fixed... but as I said not by the patch you
>>> mentioned.
>>>
>>> Tested patches to fix this are welcome, indeed.
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list