[R-SIG-Mac] Bug in reading UTF-16LE file?

Prof Brian Ripley r|p|ey @end|ng |rom @t@t@@ox@@c@uk
Mon Sep 9 11:30:12 CEST 2024


On 08/09/2024 23:41, Jeff Newmiller via R-SIG-Mac wrote:
> I don't know whether MacOSX uses libiconv,

It no longer does although reports compatibility with GNU libiconv 1.13. 
It is not at all compatible, which has caused a lot of extra work, not 
least as the incompatibilities have been changed/increased at point 
releases of macOS 14.  OTOH, the minimum requirement of R's binary macOS 
builds does use libiconv, probably 1.11 (which is old, 2006).  So 
testing iconv on macOS is a lottery.

Note that neither Linux nor Windows use GNU libiconv, and AFAIR neither 
does recent FreebSD.  Last year when I worked on iconv I did not find a 
platform currently using GNU libiconv and had to use a temporary 
installation from the sources.

 > but I was looking at libiconv-1.17/lib/utf16.h and utf16_mbtowc 
assumes the first argument has an istate element that is pre-initialized 
to the architecture endianness. I don't have time to keep digging into 
this right now (and no ARM mac to debug on), but if that was somehow 
always set to LE in this context (by R?) then I think that would explain 
this behavior.
> 
> I know, most people will just bail on UTF16 and use the UTF16LE hack (hacky because the BOM is there you aren't supposed to use LE) to get on with life, but this seems to me like an unfortunate failure to follow the standard that ought to have been noticed by now. [1]
> 
> [1] https://unicode.org/faq/utf_bom.html#bom10 item (4)... don't mix LE/BE specification with data that has a BOM.
> 
> On September 8, 2024 2:23:36 AM PDT, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>>
>> On R-help there's a thread about reading a remote file that is coded in UTF-16LE with a byte-order mark.  Jeff Newmiller pointed out (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it would be better to declare the encoding as "UTF-16", because the BOM will indicate little endian.
>>
>> I tried this on my Mac running R 4.4.1, and it didn't work.  I get the same incorrect result from all of these commands:
>>
>> # Automatically recognizing a URL and using fileEncoding:
>> read.delim(
>>
>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>      fileEncoding = "UTF-16"
>> )
>>
>> # Using explicit url() with encoding:
>> read.delim(
>>
>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>         encoding = "UTF-16")
>> )
>>
>> # Specifying the endianness incorrectly:
>> read.delim(
>>
>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>         encoding = "UTF-16BE")
>> )
>>
>> The only way I get the correct result is if I specify "UTF-16LE" explicitly, whereas Jeff got correct results on several different systems using "UTF-16".
>>
>> Is this a MacOS bug or an R for MacOS bug?
>>
>> Duncan Murdoch
> 


-- 
Brian D. Ripley,                  ripley using stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford



More information about the R-SIG-Mac mailing list