[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
Duncan Murdoch
murdoch.duncan at gmail.com
Mon Feb 29 19:30:13 CET 2016
I have just committed your first patch (the strlen() replacement) to
R-devel, and will soon put it in R-patched as well. I wont have time to
look at this again before the 3.2.4 release, so your file.show() patch
isn't going to make it unless someone else gets to it.
There's still a faint chance that I'll do more in R-devel before 3.3.0,
but I think it's best if there were bug reports about both of these
problems so they don't get forgotten. Since the first one is mainly a
Windows problem, I'll write that one up; I'd appreciate it if you could
write up the file.show() issue, after checking against R-devel rev 70247
or higher.
Duncan Murdoch
On 25/02/2016 5:54 AM, Mikko Korpela wrote:
> On 25.02.2016 11:31, Mikko Korpela wrote:
>> On 23.02.2016 14:06, Mikko Korpela wrote:
>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>>
>>>> > Dear R developers
>>>> > I think I have found a bug that can be reproduced with two lines of code
>>>> > and I am very thankful to get your first assessment or feed-back on my
>>>> > report.
>>>>
>>>> > If this is the wrong mailing list or I did something wrong
>>>> > (e. g. semi "anonymous" email address to protect my privacy and defend
>>>> > unwanted spam) please let me know since I am new here.
>>>>
>>>> > Thank you very much :-)
>>>>
>>>> > J. Altfeld
>>>>
>>>> Dear J.,
>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>
>>>> You are right, this is a bug, at least in the documentation, but
>>>> probably "all real", indeed,
>>>>
>>>> but read on.
>>>>
>>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>> >>
>>>> >>
>>>> >> If I execute the code from the "?write.table" examples section
>>>> >>
>>>> >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>> >> # (ommited code)
>>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>> >>
>>>> >> the resulting CSV file has a size of 6 bytes which is too short
>>>> >> (truncated):
>>>> >>
>>>> >> """,3
>>>>
>>>> reproducibly, yes.
>>>> If you look at what write.csv does
>>>> and then simplify, you can get a similar wrong result by
>>>>
>>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>>
>>>> which results in a file with one line
>>>>
>>>> """ 3
>>>>
>>>> and if you debug write.table() you see that its building blocks
>>>> here are
>>>> file <- file(........, encoding = fileEncoding)
>>>>
>>>> a writeLines(*, file=file) for the column headers,
>>>>
>>>> and then "deeper down" C code which I did not investigate.
>>>
>>> I took a look at connections.c. There is a call to strlen() that gets
>>> confused by null characters. I think the obvious fix is to avoid the
>>> call to strlen() as the size is already known:
>>>
>>> Index: src/main/connections.c
>>> ===================================================================
>>> --- src/main/connections.c (revision 70213)
>>> +++ src/main/connections.c (working copy)
>>> @@ -369,7 +369,7 @@
>>> /* is this safe? */
>>> warning(_("invalid char string in output conversion"));
>>> *ob = '\0';
>>> - con->write(outbuf, 1, strlen(outbuf), con);
>>> + con->write(outbuf, 1, ob - outbuf, con);
>>> } while(again && inb > 0); /* it seems some iconv signal -1 on
>>> zero-length input */
>>> } else
>>>
>>>
>>>>
>>>> But just looking a bit at such a file() object with writeLines()
>>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>>> "work" for this encoding:
>>>>
>>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
>>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
>>>> > close(ff)
>>>> > file.show(fn)
>>>> CBA|>
>>>> > file.size(fn)
>>>> [1] 5
>>>> >
>>>
>>> With the patch applied:
>>>
>>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>> [1] "C" "B" "A" "|" ">a"
>>> > file.size(fn)
>>> [1] 22
>> I just realized that I was misusing the encoding argument of
>> readLines(). The code above works by accident, but the following would
>> be more appropriate:
>>
>> > ff <- file(fn, open="r", encoding="UTF-16LE")
>> > readLines(ff)
>> [1] "C" "B" "A" "|" ">a"
>> > close(ff)
>>
>> Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
>> the patch is incomplete on Windows.)
> Before inspecting the file with readLines() I tried file.show() but it
> did not work as expected. On Linux using a UTF-8 locale, the result of
> trying to show the truly UTF-16LE encoded file with
>
> > file.show(fn, encoding="UTF-16LE")
>
> was a pager showing "<43>" (quotes not included) followed by several
> empty lines.
>
> With the following patch, the command works correctly (in this case, on
> this platform, not tested comprehensively). The idea is to read the
> input file "raw" in order to avoid problems with null characters. The
> input then needs to be split into lines after iconv(), or it could be
> written to the output file with cat() if the style of line termination
> characters does not matter. The 'perl = TRUE' is for assumed performance
> advantage only. It can be removed, or one might want to test if there is
> a significant difference one way or the other.
>
> - Mikko
>
> Index: src/library/base/R/files.R
> ===================================================================
> --- src/library/base/R/files.R (revision 70217)
> +++ src/library/base/R/files.R (working copy)
> @@ -50,10 +50,13 @@
> for(i in seq_along(files)) {
> f <- files[i]
> tf <- tempfile()
> - tmp <- readLines(f, warn = FALSE)
> + tmp <- list(readBin(f, "raw", file.size(f)))
> tmp2 <- try(iconv(tmp, encoding, "", "byte"))
> if(inherits(tmp2, "try-error")) file.copy(f, tf)
> - else writeLines(tmp2, tf)
> + else {
> + tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]]
> + writeLines(tmp2, tf)
> + }
> files[i] <- tf
> if(delete.file) unlink(f)
> }
>
More information about the R-devel
mailing list