[R-SIG-Mac] Bug in reading UTF-16LE file?

peter dalgaard pd@|gd @end|ng |rom gm@||@com
Mon Sep 9 10:53:45 CEST 2024


I am confused, and maybe I should just butt out of this, but:

(a) BOM are designed to, um, mark the byte order...

(b) in connections.c we have 

            if(checkBOM && con->inavail >= 2 &&
               ((int)con->iconvbuff[0] & 0xff) == 255 &&
               ((int)con->iconvbuff[1] & 0xff) == 254) {
                con->inavail -= (short) 2;
                memmove(con->iconvbuff, con->iconvbuff+2, con->inavail);
            }
 
which checks for the two first bytes being FF, FE. However, a big-endian BOM would be FE, FF and I see no check for that.

Duncan's file starts

> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', what="raw", n=10)
 [1] ff fe 74 00 69 00 6d 00 65 00

so the BOM does indeed indicate little-endian, but apparently we proceed to discard it and read the file with system (big-)endianness, which strikes me as just plain wrong...

I see no Mac-specific code for this, only win_iconv.c, so presumably we have potential issues on everything non-Windows?

-pd

> On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urbanek using r-project.org> wrote:
> 
> From the help page:
> 
>     The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
>     as they are appropriate values for Windows ‘Unicode’ text files.
>     If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
>     are removed as some implementations of ‘iconv’ do not accept BOMs.
> 
> so "UTF-16LE" is the documented way to reliably read such files.
> 
> Cheers,
> Simon
> 
> 
> 
>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>> 
>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>> 
>> On R-help there's a thread about reading a remote file that is coded in UTF-16LE with a byte-order mark.  Jeff Newmiller pointed out (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it would be better to declare the encoding as "UTF-16", because the BOM will indicate little endian.
>> 
>> I tried this on my Mac running R 4.4.1, and it didn't work.  I get the same incorrect result from all of these commands:
>> 
>> # Automatically recognizing a URL and using fileEncoding:
>> read.delim(
>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>    fileEncoding = "UTF-16"
>> )
>> 
>> # Using explicit url() with encoding:
>> read.delim(
>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>       encoding = "UTF-16")
>> )
>> 
>> # Specifying the endianness incorrectly:
>> read.delim(
>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>       encoding = "UTF-16BE")
>> )
>> 
>> The only way I get the correct result is if I specify "UTF-16LE" explicitly, whereas Jeff got correct results on several different systems using "UTF-16".
>> 
>> Is this a MacOS bug or an R for MacOS bug?
>> 
>> Duncan Murdoch
>> 
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac using r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>> 
> 
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com



More information about the R-SIG-Mac mailing list