[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Mon Feb 29 19:30:13 CET 2016

I have just committed your first patch (the strlen() replacement) to 
R-devel, and will soon put it in R-patched as well.  I wont have time to 
look at this again before the 3.2.4 release, so your file.show() patch 
isn't going to make it unless someone else gets to it.

There's still a faint chance that I'll do more in R-devel before 3.3.0, 
but I think it's best if there were bug reports about both of these 
problems so they don't get forgotten.  Since the first one is mainly a 
Windows problem, I'll write that one up; I'd appreciate it if you could 
write up the file.show() issue, after checking against R-devel rev 70247 
or higher.

Duncan Murdoch

On 25/02/2016 5:54 AM, Mikko Korpela wrote:
> On 25.02.2016 11:31, Mikko Korpela wrote:
>> On 23.02.2016 14:06, Mikko Korpela wrote:
>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>>      on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>>
>>>>      > Dear R developers
>>>>      > I think I have found a bug that can be reproduced with two lines of code
>>>>      > and I am very thankful to get your first assessment or feed-back on my
>>>>      > report.
>>>>
>>>>      > If this is the wrong mailing list or I did something wrong
>>>>      > (e. g. semi "anonymous" email address to protect my privacy and defend
>>>>      > unwanted spam) please let me know since I am new here.
>>>>
>>>>      > Thank you very much :-)
>>>>
>>>>      > J. Altfeld
>>>>
>>>> Dear J.,
>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>
>>>> You are right, this is a bug, at least in the documentation, but
>>>> probably "all real", indeed,
>>>>
>>>> but read on.
>>>>
>>>>      > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>>      >>
>>>>      >>
>>>>      >> If I execute the code from the "?write.table" examples section
>>>>      >>
>>>>      >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>>      >> # (ommited code)
>>>>      >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>>      >>
>>>>      >> the resulting CSV file has a size of 6 bytes which is too short
>>>>      >> (truncated):
>>>>      >>
>>>>      >> """,3
>>>>
>>>> reproducibly, yes.
>>>> If you look at what write.csv does
>>>> and then simplify, you can get a similar wrong result by
>>>>
>>>>    write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>>
>>>> which results in a file with one line
>>>>
>>>> """ 3
>>>>
>>>> and if you debug  write.table() you see that its building blocks
>>>> here are
>>>> 	 file <- file(........, encoding = fileEncoding)
>>>>
>>>> a 	 writeLines(*, file=file)  for the column headers,
>>>>
>>>> and then "deeper down" C code which I did not investigate.
>>>
>>> I took a look at connections.c. There is a call to strlen() that gets
>>> confused by null characters. I think the obvious fix is to avoid the
>>> call to strlen() as the size is already known:
>>>
>>> Index: src/main/connections.c
>>> ===================================================================
>>> --- src/main/connections.c	(revision 70213)
>>> +++ src/main/connections.c	(working copy)
>>> @@ -369,7 +369,7 @@
>>>   		/* is this safe? */
>>>   		warning(_("invalid char string in output conversion"));
>>>   	    *ob = '\0';
>>> -	    con->write(outbuf, 1, strlen(outbuf), con);
>>> +	    con->write(outbuf, 1, ob - outbuf, con);
>>>   	} while(again && inb > 0);  /* it seems some iconv signal -1 on
>>>   				       zero-length input */
>>>       } else
>>>
>>>
>>>>
>>>> But just looking a bit at such a file() object with writeLines()
>>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>>> "work" for this encoding:
>>>>
>>>>      > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
>>>>      > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
>>>>      > close(ff)
>>>>      > file.show(fn)
>>>>      CBA|>
>>>>      > file.size(fn)
>>>>      [1] 5
>>>>      >
>>>
>>> With the patch applied:
>>>
>>>      > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>>      [1] "C"  "B"  "A"  "|"  ">a"
>>>      > file.size(fn)
>>>      [1] 22
>> I just realized that I was misusing the encoding argument of
>> readLines(). The code above works by accident, but the following would
>> be more appropriate:
>>
>>      > ff <- file(fn, open="r", encoding="UTF-16LE")
>>      > readLines(ff)
>>      [1] "C"  "B"  "A"  "|"  ">a"
>>      > close(ff)
>>
>> Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
>> the patch is incomplete on Windows.)
> Before inspecting the file with readLines() I tried file.show() but it
> did not work as expected. On Linux using a UTF-8 locale, the result of
> trying to show the truly UTF-16LE encoded file with
>
>      > file.show(fn, encoding="UTF-16LE")
>
> was a pager showing "<43>" (quotes not included) followed by several
> empty lines.
>
> With the following patch, the command works correctly (in this case, on
> this platform, not tested comprehensively). The idea is to read the
> input file "raw" in order to avoid problems with null characters. The
> input then needs to be split into lines after iconv(), or it could be
> written to the output file with cat() if the style of line termination
> characters does not matter. The 'perl = TRUE' is for assumed performance
> advantage only. It can be removed, or one might want to test if there is
> a significant difference one way or the other.
>
> - Mikko
>
> Index: src/library/base/R/files.R
> ===================================================================
> --- src/library/base/R/files.R	(revision 70217)
> +++ src/library/base/R/files.R	(working copy)
> @@ -50,10 +50,13 @@
>           for(i in seq_along(files)) {
>               f <- files[i]
>               tf <- tempfile()
> -            tmp <- readLines(f, warn = FALSE)
> +            tmp <- list(readBin(f, "raw", file.size(f)))
>               tmp2 <- try(iconv(tmp, encoding, "", "byte"))
>               if(inherits(tmp2, "try-error")) file.copy(f, tf)
> -            else writeLines(tmp2, tf)
> +            else {
> +                tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]]
> +                writeLines(tmp2, tf)
> +            }
>               files[i] <- tf
>               if(delete.file) unlink(f)
>           }
>