[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
Mikko Korpela
mikko.korpela at aalto.fi
Mon Feb 29 21:30:41 CET 2016
The file.show() issue is now in the bug tracker. I used a slightly
different example to demonstrate the problem.
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16738
- Mikko
On 29.02.2016 20:30, Duncan Murdoch wrote:
> I have just committed your first patch (the strlen() replacement) to
> R-devel, and will soon put it in R-patched as well. I wont have time to
> look at this again before the 3.2.4 release, so your file.show() patch
> isn't going to make it unless someone else gets to it.
>
> There's still a faint chance that I'll do more in R-devel before 3.3.0,
> but I think it's best if there were bug reports about both of these
> problems so they don't get forgotten. Since the first one is mainly a
> Windows problem, I'll write that one up; I'd appreciate it if you could
> write up the file.show() issue, after checking against R-devel rev 70247
> or higher.
>
> Duncan Murdoch
>
> On 25/02/2016 5:54 AM, Mikko Korpela wrote:
>> On 25.02.2016 11:31, Mikko Korpela wrote:
>>> On 23.02.2016 14:06, Mikko Korpela wrote:
>>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>>>
>>>>> > Dear R developers
>>>>> > I think I have found a bug that can be reproduced with two
>>>>> lines of code
>>>>> > and I am very thankful to get your first assessment or
>>>>> feed-back on my
>>>>> > report.
>>>>>
>>>>> > If this is the wrong mailing list or I did something wrong
>>>>> > (e. g. semi "anonymous" email address to protect my privacy
>>>>> and defend
>>>>> > unwanted spam) please let me know since I am new here.
>>>>>
>>>>> > Thank you very much :-)
>>>>>
>>>>> > J. Altfeld
>>>>>
>>>>> Dear J.,
>>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>>
>>>>> You are right, this is a bug, at least in the documentation, but
>>>>> probably "all real", indeed,
>>>>>
>>>>> but read on.
>>>>>
>>>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>>> >>
>>>>> >>
>>>>> >> If I execute the code from the "?write.table" examples section
>>>>> >>
>>>>> >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>>> >> # (ommited code)
>>>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>>> >>
>>>>> >> the resulting CSV file has a size of 6 bytes which is too
>>>>> short
>>>>> >> (truncated):
>>>>> >>
>>>>> >> """,3
>>>>>
>>>>> reproducibly, yes.
>>>>> If you look at what write.csv does
>>>>> and then simplify, you can get a similar wrong result by
>>>>>
>>>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>>>
>>>>> which results in a file with one line
>>>>>
>>>>> """ 3
>>>>>
>>>>> and if you debug write.table() you see that its building blocks
>>>>> here are
>>>>> file <- file(........, encoding = fileEncoding)
>>>>>
>>>>> a writeLines(*, file=file) for the column headers,
>>>>>
>>>>> and then "deeper down" C code which I did not investigate.
>>>>
>>>> I took a look at connections.c. There is a call to strlen() that gets
>>>> confused by null characters. I think the obvious fix is to avoid the
>>>> call to strlen() as the size is already known:
>>>>
>>>> Index: src/main/connections.c
>>>> ===================================================================
>>>> --- src/main/connections.c (revision 70213)
>>>> +++ src/main/connections.c (working copy)
>>>> @@ -369,7 +369,7 @@
>>>> /* is this safe? */
>>>> warning(_("invalid char string in output conversion"));
>>>> *ob = '\0';
>>>> - con->write(outbuf, 1, strlen(outbuf), con);
>>>> + con->write(outbuf, 1, ob - outbuf, con);
>>>> } while(again && inb > 0); /* it seems some iconv signal -1 on
>>>> zero-length input */
>>>> } else
>>>>
>>>>
>>>>>
>>>>> But just looking a bit at such a file() object with writeLines()
>>>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>>>> "work" for this encoding:
>>>>>
>>>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding =
>>>>> "UTF-16LE")
>>>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
>>>>> writeLines(">a", ff)
>>>>> > close(ff)
>>>>> > file.show(fn)
>>>>> CBA|>
>>>>> > file.size(fn)
>>>>> [1] 5
>>>>> >
>>>>
>>>> With the patch applied:
>>>>
>>>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>>> [1] "C" "B" "A" "|" ">a"
>>>> > file.size(fn)
>>>> [1] 22
>>> I just realized that I was misusing the encoding argument of
>>> readLines(). The code above works by accident, but the following would
>>> be more appropriate:
>>>
>>> > ff <- file(fn, open="r", encoding="UTF-16LE")
>>> > readLines(ff)
>>> [1] "C" "B" "A" "|" ">a"
>>> > close(ff)
>>>
>>> Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
>>> the patch is incomplete on Windows.)
>> Before inspecting the file with readLines() I tried file.show() but it
>> did not work as expected. On Linux using a UTF-8 locale, the result of
>> trying to show the truly UTF-16LE encoded file with
>>
>> > file.show(fn, encoding="UTF-16LE")
>>
>> was a pager showing "<43>" (quotes not included) followed by several
>> empty lines.
>>
>> With the following patch, the command works correctly (in this case, on
>> this platform, not tested comprehensively). The idea is to read the
>> input file "raw" in order to avoid problems with null characters. The
>> input then needs to be split into lines after iconv(), or it could be
>> written to the output file with cat() if the style of line termination
>> characters does not matter. The 'perl = TRUE' is for assumed performance
>> advantage only. It can be removed, or one might want to test if there is
>> a significant difference one way or the other.
>>
>> - Mikko
>>
>> Index: src/library/base/R/files.R
>> ===================================================================
>> --- src/library/base/R/files.R (revision 70217)
>> +++ src/library/base/R/files.R (working copy)
>> @@ -50,10 +50,13 @@
>> for(i in seq_along(files)) {
>> f <- files[i]
>> tf <- tempfile()
>> - tmp <- readLines(f, warn = FALSE)
>> + tmp <- list(readBin(f, "raw", file.size(f)))
>> tmp2 <- try(iconv(tmp, encoding, "", "byte"))
>> if(inherits(tmp2, "try-error")) file.copy(f, tf)
>> - else writeLines(tmp2, tf)
>> + else {
>> + tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]]
>> + writeLines(tmp2, tf)
>> + }
>> files[i] <- tf
>> if(delete.file) unlink(f)
>> }
More information about the R-devel
mailing list