[R-SIG-Mac] Bug in reading UTF-16LE file?

Mon Sep 9 16:54:13 CEST 2024

Definitely not about R... but to the question:

All C compilers (well, really all computer languages) logically regard integers as big-endian, regardless of whether the underlying bytes are BE or LE. Converting a byte stream (bytes) to wide character data (ints or uints) only needs to swap bytes in the LE case using bit shifting.

You cannot rely on "same as my architecture" pointer re-interpretation of multi-byte values because most of the time the word size won't match and even if it does the word-boundary alignment will usually be off and the pointer dereference will fail.

On September 9, 2024 1:53:45 AM PDT, peter dalgaard <pdalgd using gmail.com> wrote:
>I am confused, and maybe I should just butt out of this, but:
>
>(a) BOM are designed to, um, mark the byte order...
>
>(b) in connections.c we have 
>
>            if(checkBOM && con->inavail >= 2 &&
>               ((int)con->iconvbuff[0] & 0xff) == 255 &&
>               ((int)con->iconvbuff[1] & 0xff) == 254) {
>                con->inavail -= (short) 2;
>                memmove(con->iconvbuff, con->iconvbuff+2, con->inavail);
>            }
> 
>which checks for the two first bytes being FF, FE. However, a big-endian BOM would be FE, FF and I see no check for that.
>
>Duncan's file starts
>
>> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', what="raw", n=10)
> [1] ff fe 74 00 69 00 6d 00 65 00
>
>so the BOM does indeed indicate little-endian, but apparently we proceed to discard it and read the file with system (big-)endianness, which strikes me as just plain wrong...
>
>I see no Mac-specific code for this, only win_iconv.c, so presumably we have potential issues on everything non-Windows?
>
>-pd
>
>> On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urbanek using r-project.org> wrote:
>> 
>> From the help page:
>> 
>>     The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
>>     as they are appropriate values for Windows ‘Unicode’ text files.
>>     If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
>>     are removed as some implementations of ‘iconv’ do not accept BOMs.
>> 
>> so "UTF-16LE" is the documented way to reliably read such files.
>> 
>> Cheers,
>> Simon
>> 
>> 
>> 
>>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>> 
>>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>>> 
>>> On R-help there's a thread about reading a remote file that is coded in UTF-16LE with a byte-order mark.  Jeff Newmiller pointed out (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it would be better to declare the encoding as "UTF-16", because the BOM will indicate little endian.
>>> 
>>> I tried this on my Mac running R 4.4.1, and it didn't work.  I get the same incorrect result from all of these commands:
>>> 
>>> # Automatically recognizing a URL and using fileEncoding:
>>> read.delim(
>>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>    fileEncoding = "UTF-16"
>>> )
>>> 
>>> # Using explicit url() with encoding:
>>> read.delim(
>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>       encoding = "UTF-16")
>>> )
>>> 
>>> # Specifying the endianness incorrectly:
>>> read.delim(
>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>       encoding = "UTF-16BE")
>>> )
>>> 
>>> The only way I get the correct result is if I specify "UTF-16LE" explicitly, whereas Jeff got correct results on several different systems using "UTF-16".
>>> 
>>> Is this a MacOS bug or an R for MacOS bug?
>>> 
>>> Duncan Murdoch
>>> 
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> R-SIG-Mac using r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>> 
>> 
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac using r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>

-- 
Sent from my phone. Please excuse my brevity.