[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?

Jack Kelley Jack.Kelley at bigpond.com
Sun Apr 30 01:53:02 CEST 2017


"R version 3.4.0 (2017-04-21)"  on "x86_64-w64-mingw32" platform

I am using CSVs and other text tables, and text in general (including
regular expressions), on Windows 10.
For me, that means dealing with Windows-1252 and UTF-8 encoding, with UTF-16
and UTF-32 as helpful curiosities.

Something as simple as iconv ("\n", to = "UTF-16") causes an error, due to
an embedded nul.

Then there is write.csv (or write.table) with its fileEncoding parameter:
not working correctly for UTF-16 and UTF-32.

Of course, developers are aware of this, for example 


[Rd] iconv to UTF-16 encoding produces error due to embedded nulls
(write.table with fileEncoding param)
https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html

iconv to UTF-16 encoding produces error due to embedded nulls (write.table
with fileEncoding param)
http://r.789695.n4.nabble.com/iconv-to-UTF-16-encoding-produces-error-due-to
-embedded-nulls-write-table-with-fileEncoding-param-td4717481.html

----------------------------------------------------------------------------
------------------------

Focussing on write.csv and UTF-16LE and UTF-16BE, it seems that a nul
character is omitted in each <CarriageReturn><LineFeed> pair.

TEST SCRIPT
----------------------------------------------------------------------------
------------------------
remove (list = objects())

print (sessionInfo())
cat ("---------------------------------\n\n")

LE <- data.frame (
  want = c ("0d,00", "0a,00"),
  got  = c ("0d   ", "0a,00")
)

BE <- data.frame (
  want = c ("00,0d", "00,0a"),
  got  = c ("00,0d", "   0a")
)

write.csv (LE, "R_LE.csv", fileEncoding = "UTF-16LE", row.names = FALSE)
write.csv (BE, "R_BE.csv", fileEncoding = "UTF-16BE", row.names = FALSE)

print (readBin ("R_LE.csv", "raw", 1000))
print (LE)
cat ("\n")

print (readBin ("R_BE.csv", "raw", 1000))
print (BE)
cat ("\n")

try (iconv ("\n", to = "UTF-8"))

try (iconv ("\n", to = "UTF-16LE"))
try (iconv ("\n", to = "UTF-16BE"))
try (iconv ("\n", to = "UTF-16"))

try (iconv ("\n", to = "UTF-32LE"))
try (iconv ("\n", to = "UTF-32BE"))
try (iconv ("\n", to = "UTF-32"))
----------------------------------------------------------------------------
------------------------

TEST SCRIPT OUTPUT

> source ("bug_encoding.R")
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.0
---------------------------------

 [1] 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00
0d
[26] 0a 00 22 00 30 00 64 00 2c 00 30 00 30 00 22 00 2c 00 22 00 30 00 64 00
20
[51] 00 20 00 20 00 22 00 0d 0a 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00
2c
[76] 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 0d 0a 00
   want   got
1 0d,00 0d
2 0a,00 0a,00

 [1] 00 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22
00
[26] 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 64 00 22 00 2c 00 22 00 30 00 30
00
[51] 2c 00 30 00 64 00 22 00 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 61 00 22
00
[76] 2c 00 22 00 20 00 20 00 20 00 30 00 61 00 22 00 0d 0a
   want   got
1 00,0d 00,0d
2 00,0a    0a

Error in iconv("\n", to = "UTF-16LE") : embedded nul in string: '\n\0'
Error in iconv("\n", to = "UTF-16BE") : embedded nul in string: '\0\n'
Error in iconv("\n", to = "UTF-16") : embedded nul in string: 'þÿ\0\n'
Error in iconv("\n", to = "UTF-32LE") :
  embedded nul in string: '\n\0\0\0'
Error in iconv("\n", to = "UTF-32BE") :
  embedded nul in string: '\0\0\0\n'
Error in iconv("\n", to = "UTF-32") :
  embedded nul in string: '\0\0þÿ\0\0\0\n'
>
----------------------------------------------------------------------------
------------------------
Cheers -- Jack Kelley



More information about the R-devel mailing list