[R-SIG-Mac] Bug in reading UTF-16LE file?
peter dalgaard
pd@|gd @end|ng |rom gm@||@com
Mon Sep 9 10:53:45 CEST 2024
I am confused, and maybe I should just butt out of this, but:
(a) BOM are designed to, um, mark the byte order...
(b) in connections.c we have
if(checkBOM && con->inavail >= 2 &&
((int)con->iconvbuff[0] & 0xff) == 255 &&
((int)con->iconvbuff[1] & 0xff) == 254) {
con->inavail -= (short) 2;
memmove(con->iconvbuff, con->iconvbuff+2, con->inavail);
}
which checks for the two first bytes being FF, FE. However, a big-endian BOM would be FE, FF and I see no check for that.
Duncan's file starts
> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', what="raw", n=10)
[1] ff fe 74 00 69 00 6d 00 65 00
so the BOM does indeed indicate little-endian, but apparently we proceed to discard it and read the file with system (big-)endianness, which strikes me as just plain wrong...
I see no Mac-specific code for this, only win_iconv.c, so presumably we have potential issues on everything non-Windows?
-pd
> On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urbanek using r-project.org> wrote:
>
> From the help page:
>
> The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
> as they are appropriate values for Windows ‘Unicode’ text files.
> If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
> are removed as some implementations of ‘iconv’ do not accept BOMs.
>
> so "UTF-16LE" is the documented way to reliably read such files.
>
> Cheers,
> Simon
>
>
>
>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>
>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>>
>> On R-help there's a thread about reading a remote file that is coded in UTF-16LE with a byte-order mark. Jeff Newmiller pointed out (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it would be better to declare the encoding as "UTF-16", because the BOM will indicate little endian.
>>
>> I tried this on my Mac running R 4.4.1, and it didn't work. I get the same incorrect result from all of these commands:
>>
>> # Automatically recognizing a URL and using fileEncoding:
>> read.delim(
>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>> fileEncoding = "UTF-16"
>> )
>>
>> # Using explicit url() with encoding:
>> read.delim(
>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>> encoding = "UTF-16")
>> )
>>
>> # Specifying the endianness incorrectly:
>> read.delim(
>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>> encoding = "UTF-16BE")
>> )
>>
>> The only way I get the correct result is if I specify "UTF-16LE" explicitly, whereas Jeff got correct results on several different systems using "UTF-16".
>>
>> Is this a MacOS bug or an R for MacOS bug?
>>
>> Duncan Murdoch
>>
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac using r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>
>
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk Priv: PDalgd using gmail.com
More information about the R-SIG-Mac
mailing list