[R-SIG-Mac] Bug in reading UTF-16LE file?
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Mon Sep 9 12:53:25 CEST 2024
On 9/9/24 10:53, peter dalgaard wrote:
> I am confused, and maybe I should just butt out of this, but:
>
> (a) BOM are designed to, um, mark the byte order...
>
> (b) in connections.c we have
>
> if(checkBOM && con->inavail >= 2 &&
> ((int)con->iconvbuff[0] & 0xff) == 255 &&
> ((int)con->iconvbuff[1] & 0xff) == 254) {
> con->inavail -= (short) 2;
> memmove(con->iconvbuff, con->iconvbuff+2, con->inavail);
> }
>
> which checks for the two first bytes being FF, FE. However, a big-endian BOM would be FE, FF and I see no check for that.
I think this is correct, it is executed only for encodings declared
little-endian (UTF-16LE, UCS2-LE) - so, iconv will still know what is
the byte-order from the name of the encoding, it will just not see the
same information in the BOM.
>
> Duncan's file starts
>
>> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', what="raw", n=10)
> [1] ff fe 74 00 69 00 6d 00 65 00
>
> so the BOM does indeed indicate little-endian, but apparently we proceed to discard it and read the file with system (big-)endianness, which strikes me as just plain wrong...
I've tested we are not discarding it by the code above and that iconv
gets to see the BOM bytes.
>
> I see no Mac-specific code for this, only win_iconv.c, so presumably we have potential issues on everything non-Windows?
I can reproduce the problem and will have a closer look, it may still be
there is a bug in R. We have some work-arounds for recent iconv issues
on macOS in sysutils.c.
Tomas
>
> -pd
>
>> On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urbanek using r-project.org> wrote:
>>
>> From the help page:
>>
>> The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
>> as they are appropriate values for Windows ‘Unicode’ text files.
>> If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
>> are removed as some implementations of ‘iconv’ do not accept BOMs.
>>
>> so "UTF-16LE" is the documented way to reliably read such files.
>>
>> Cheers,
>> Simon
>>
>>
>>
>>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>>
>>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>>>
>>> On R-help there's a thread about reading a remote file that is coded in UTF-16LE with a byte-order mark. Jeff Newmiller pointed out (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it would be better to declare the encoding as "UTF-16", because the BOM will indicate little endian.
>>>
>>> I tried this on my Mac running R 4.4.1, and it didn't work. I get the same incorrect result from all of these commands:
>>>
>>> # Automatically recognizing a URL and using fileEncoding:
>>> read.delim(
>>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>> fileEncoding = "UTF-16"
>>> )
>>>
>>> # Using explicit url() with encoding:
>>> read.delim(
>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>> encoding = "UTF-16")
>>> )
>>>
>>> # Specifying the endianness incorrectly:
>>> read.delim(
>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>> encoding = "UTF-16BE")
>>> )
>>>
>>> The only way I get the correct result is if I specify "UTF-16LE" explicitly, whereas Jeff got correct results on several different systems using "UTF-16".
>>>
>>> Is this a MacOS bug or an R for MacOS bug?
>>>
>>> Duncan Murdoch
>>>
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> R-SIG-Mac using r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac using r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
More information about the R-SIG-Mac
mailing list