[R-SIG-Mac] Bug in reading UTF-16LE file?

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Mon Sep 9 00:41:00 CEST 2024


I don't know whether MacOSX uses libiconv, but I was looking at libiconv-1.17/lib/utf16.h and utf16_mbtowc assumes the first argument has an istate element that is pre-initialized to the architecture endianness. I don't have time to keep digging into this right now (and no ARM mac to debug on), but if that was somehow always set to LE in this context (by R?) then I think that would explain this behavior.

I know, most people will just bail on UTF16 and use the UTF16LE hack (hacky because the BOM is there you aren't supposed to use LE) to get on with life, but this seems to me like an unfortunate failure to follow the standard that ought to have been noticed by now. [1]

[1] https://unicode.org/faq/utf_bom.html#bom10 item (4)... don't mix LE/BE specification with data that has a BOM.

On September 8, 2024 2:23:36 AM PDT, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>To R-SIG-Mac, with a copy to Jeff Newmiller:
>
>On R-help there's a thread about reading a remote file that is coded in UTF-16LE with a byte-order mark.  Jeff Newmiller pointed out (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it would be better to declare the encoding as "UTF-16", because the BOM will indicate little endian.
>
>I tried this on my Mac running R 4.4.1, and it didn't work.  I get the same incorrect result from all of these commands:
>
> # Automatically recognizing a URL and using fileEncoding:
> read.delim(
>
>'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>     fileEncoding = "UTF-16"
> )
>
> # Using explicit url() with encoding:
> read.delim(
>
>url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>        encoding = "UTF-16")
> )
>
> # Specifying the endianness incorrectly:
> read.delim(
>
>url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>        encoding = "UTF-16BE")
> )
>
>The only way I get the correct result is if I specify "UTF-16LE" explicitly, whereas Jeff got correct results on several different systems using "UTF-16".
>
>Is this a MacOS bug or an R for MacOS bug?
>
>Duncan Murdoch

-- 
Sent from my phone. Please excuse my brevity.



More information about the R-SIG-Mac mailing list