[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Mikko Korpela mikko.korpela at aalto.fi
Thu Feb 25 11:54:56 CET 2016


On 25.02.2016 11:31, Mikko Korpela wrote:
> On 23.02.2016 14:06, Mikko Korpela wrote:
>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de>
>>>>>>>>     on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>
>>>     > Dear R developers
>>>     > I think I have found a bug that can be reproduced with two lines of code
>>>     > and I am very thankful to get your first assessment or feed-back on my
>>>     > report.
>>>
>>>     > If this is the wrong mailing list or I did something wrong
>>>     > (e. g. semi "anonymous" email address to protect my privacy and defend
>>>     > unwanted spam) please let me know since I am new here.
>>>
>>>     > Thank you very much :-)
>>>
>>>     > J. Altfeld
>>>
>>> Dear J.,
>>> (yes, a bit less anonymity would be very welcomed here!),
>>>
>>> You are right, this is a bug, at least in the documentation, but
>>> probably "all real", indeed,
>>>
>>> but read on.
>>>
>>>     > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>>     >> 
>>>     >> 
>>>     >> If I execute the code from the "?write.table" examples section
>>>     >> 
>>>     >> x <- data.frame(a = I("a \" quote"), b = pi)
>>>     >> # (ommited code)
>>>     >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>>     >> 
>>>     >> the resulting CSV file has a size of 6 bytes which is too short
>>>     >> (truncated):
>>>     >> 
>>>     >> """,3
>>>
>>> reproducibly, yes.
>>> If you look at what write.csv does
>>> and then simplify, you can get a similar wrong result by
>>>
>>>   write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>>
>>> which results in a file with one line
>>>
>>> """ 3
>>>
>>> and if you debug  write.table() you see that its building blocks
>>> here are
>>> 	 file <- file(........, encoding = fileEncoding)
>>>
>>> a 	 writeLines(*, file=file)  for the column headers,
>>>
>>> and then "deeper down" C code which I did not investigate.
>>
>> I took a look at connections.c. There is a call to strlen() that gets
>> confused by null characters. I think the obvious fix is to avoid the
>> call to strlen() as the size is already known:
>>
>> Index: src/main/connections.c
>> ===================================================================
>> --- src/main/connections.c	(revision 70213)
>> +++ src/main/connections.c	(working copy)
>> @@ -369,7 +369,7 @@
>>  		/* is this safe? */
>>  		warning(_("invalid char string in output conversion"));
>>  	    *ob = '\0';
>> -	    con->write(outbuf, 1, strlen(outbuf), con);
>> +	    con->write(outbuf, 1, ob - outbuf, con);
>>  	} while(again && inb > 0);  /* it seems some iconv signal -1 on
>>  				       zero-length input */
>>      } else
>>
>>
>>>
>>> But just looking a bit at such a file() object with writeLines()
>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>> "work" for this encoding:
>>>
>>>     > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
>>>     > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
>>>     > close(ff)
>>>     > file.show(fn)
>>>     CBA|>
>>>     > file.size(fn)
>>>     [1] 5
>>>     > 
>>
>> With the patch applied:
>>
>>     > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>     [1] "C"  "B"  "A"  "|"  ">a"
>>     > file.size(fn)
>>     [1] 22
> I just realized that I was misusing the encoding argument of
> readLines(). The code above works by accident, but the following would
> be more appropriate:
> 
>     > ff <- file(fn, open="r", encoding="UTF-16LE")
>     > readLines(ff)
>     [1] "C"  "B"  "A"  "|"  ">a"
>     > close(ff)
> 
> Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
> the patch is incomplete on Windows.)
Before inspecting the file with readLines() I tried file.show() but it
did not work as expected. On Linux using a UTF-8 locale, the result of
trying to show the truly UTF-16LE encoded file with

    > file.show(fn, encoding="UTF-16LE")

was a pager showing "<43>" (quotes not included) followed by several
empty lines.

With the following patch, the command works correctly (in this case, on
this platform, not tested comprehensively). The idea is to read the
input file "raw" in order to avoid problems with null characters. The
input then needs to be split into lines after iconv(), or it could be
written to the output file with cat() if the style of line termination
characters does not matter. The 'perl = TRUE' is for assumed performance
advantage only. It can be removed, or one might want to test if there is
a significant difference one way or the other.

- Mikko

Index: src/library/base/R/files.R
===================================================================
--- src/library/base/R/files.R	(revision 70217)
+++ src/library/base/R/files.R	(working copy)
@@ -50,10 +50,13 @@
         for(i in seq_along(files)) {
             f <- files[i]
             tf <- tempfile()
-            tmp <- readLines(f, warn = FALSE)
+            tmp <- list(readBin(f, "raw", file.size(f)))
             tmp2 <- try(iconv(tmp, encoding, "", "byte"))
             if(inherits(tmp2, "try-error")) file.copy(f, tf)
-            else writeLines(tmp2, tf)
+            else {
+                tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]]
+                writeLines(tmp2, tf)
+            }
             files[i] <- tf
             if(delete.file) unlink(f)
         }



More information about the R-devel mailing list