[R-SIG-Mac] Bug in reading UTF-16LE file?

Tue Oct 1 22:50:25 CEST 2024

On 10/1/24 15:31, Jeff Newmiller wrote:
>> This is a problem in macOS libiconv. When converting from "UTF-16" with a BOM, it correctly learns the byte-order from the BOM, but later forgets it in some cases.  This is not a problem in R, but could be worked-around in R.
> So, buggy system code on one system...
>
>> As Simon wrote, to avoid running into these problems (in released versions of R), one should use "UTF-16LE", so explicitly specify the byte-order in the encoding name.
> ... leads to institutionalized non-complince.
>
>> This is useful also because it is not clear what should be the default when no BOM is present and different systems have different defaults.
> This is nonsense, for reasons previously provided. You are calling a bug a feature. The BOM is supposed to prevent you from having to know this detail, and what you do when no BOM is present should have no bearing on this case.

I will try to explain this differently. The handling of BOMs in existing 
iconv implementations is unreliable (one issue is documented in R 
documentation, one issue is the one we have ran into now). Because it is 
unreliable, people who want to be defensive and avoid problems are 
advised to use *LE (or *BE) specifications. What is the default 
byte-order when no BOM is specified is not reliable, either (defaults 
differ between systems and the standard is open to interpretation - e.g. 
my Linux and Windows builds of R default to little-endian, while my 
macOS build defaults to big-endian). It is thus not advisable to depend 
on the default order, either, and a defensive solution is again to use 
*LE or *BE specifications. So, in principle, simply always use *LE or *BE.

This advice is not a feature, it is a work-around that works for two 
problems: that the byte order for specifications like "UTF-16" is 
unknown (bug in the standard) and that specifying the byte-order by a 
BOM is unreliable (bugs in implementations of iconv).

> If Apple is intransigent (which would not be out of character) you could avoid institutionalized non-compliance at the user level by recognizing the buggy system and replacing the generic specification with this inappropriate LE or BE specification as directed by the BOM in the Mac-specific R code.

Yes, indeed, the work-around for the libiconv bug can be implemented in 
future versions of R and an experimental version is already in R-devel 
(still subject to change), so that at user level, specifying say 
"UTF-16" on an input with BOM will correctly use the byte-order of the BOM.

I don't find anything inappropriate about the *LE/*BE specifications.

Best
Tomas

>
>
> On October 1, 2024 4:34:41 AM MST, Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
>> On 9/9/24 12:53, Tomas Kalibera wrote:
>>> On 9/9/24 10:53, peter dalgaard wrote:
>>>> I am confused, and maybe I should just butt out of this, but:
>>>>
>>>> (a) BOM are designed to, um, mark the byte order...
>>>>
>>>> (b) in connections.c we have
>>>>
>>>>               if(checkBOM && con->inavail >= 2 &&
>>>>                  ((int)con->iconvbuff[0] & 0xff) == 255 &&
>>>>                  ((int)con->iconvbuff[1] & 0xff) == 254) {
>>>>                   con->inavail -= (short) 2;
>>>>                   memmove(con->iconvbuff, con->iconvbuff+2, con->inavail);
>>>>               }
>>>>    which checks for the two first bytes being FF, FE. However, a big-endian BOM would be FE, FF and I see no check for that.
>>> I think this is correct, it is executed only for encodings declared little-endian (UTF-16LE, UCS2-LE) - so, iconv will still know what is the byte-order from the name of the encoding, it will just not see the same information in the BOM.
>>>> Duncan's file starts
>>>>
>>>>> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', what="raw", n=10)
>>>>    [1] ff fe 74 00 69 00 6d 00 65 00
>>>>
>>>> so the BOM does indeed indicate little-endian, but apparently we proceed to discard it and read the file with system (big-)endianness, which strikes me as just plain wrong...
>>> I've tested we are not discarding it by the code above and that iconv gets to see the BOM bytes.
>>>> I see no Mac-specific code for this, only win_iconv.c, so presumably we have potential issues on everything non-Windows?
>>> I can reproduce the problem and will have a closer look, it may still be there is a bug in R. We have some work-arounds for recent iconv issues on macOS in sysutils.c.
>> This is a problem in macOS libiconv. When converting from "UTF-16" with a BOM, it correctly learns the byte-order from the BOM, but later forgets it in some cases.  This is not a problem in R, but could be worked-around in R.
>>
>> As Simon wrote, to avoid running into these problems (in released versions of R), one should use "UTF-16LE", so explicitly specify the byte-order in the encoding name. This is useful also because it is not clear what should be the default when no BOM is present and different systems have different defaults.
>>
>> Best
>> Tomas
>>
>>> Tomas
>>>
>>>> -pd
>>>>
>>>>> On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urbanek using r-project.org> wrote:
>>>>>
>>>>>   From the help page:
>>>>>
>>>>>       The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
>>>>>       as they are appropriate values for Windows ‘Unicode’ text files.
>>>>>       If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
>>>>>       are removed as some implementations of ‘iconv’ do not accept BOMs.
>>>>>
>>>>> so "UTF-16LE" is the documented way to reliably read such files.
>>>>>
>>>>> Cheers,
>>>>> Simon
>>>>>
>>>>>
>>>>>
>>>>>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>>>>>
>>>>>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>>>>>>
>>>>>> On R-help there's a thread about reading a remote file that is coded in UTF-16LE with a byte-order mark.  Jeff Newmiller pointed out (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it would be better to declare the encoding as "UTF-16", because the BOM will indicate little endian.
>>>>>>
>>>>>> I tried this on my Mac running R 4.4.1, and it didn't work. I get the same incorrect result from all of these commands:
>>>>>>
>>>>>> # Automatically recognizing a URL and using fileEncoding:
>>>>>> read.delim(
>>>>>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>>>>      fileEncoding = "UTF-16"
>>>>>> )
>>>>>>
>>>>>> # Using explicit url() with encoding:
>>>>>> read.delim(
>>>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>>>>         encoding = "UTF-16")
>>>>>> )
>>>>>>
>>>>>> # Specifying the endianness incorrectly:
>>>>>> read.delim(
>>>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>>>>         encoding = "UTF-16BE")
>>>>>> )
>>>>>>
>>>>>> The only way I get the correct result is if I specify "UTF-16LE" explicitly, whereas Jeff got correct results on several different systems using "UTF-16".
>>>>>>
>>>>>> Is this a MacOS bug or an R for MacOS bug?
>>>>>>
>>>>>> Duncan Murdoch
>>>>>>
>>>>>> _______________________________________________
>>>>>> R-SIG-Mac mailing list
>>>>>> R-SIG-Mac using r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>>>>
>>>>> _______________________________________________
>>>>> R-SIG-Mac mailing list
>>>>> R-SIG-Mac using r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>> _______________________________________________
>> R-SIG-Mac mailing list
>> R-SIG-Mac using r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac