[R-SIG-Mac] Bug in reading UTF-16LE file?
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Tue Oct 1 13:34:41 CEST 2024
On 9/9/24 12:53, Tomas Kalibera wrote:
>
> On 9/9/24 10:53, peter dalgaard wrote:
>> I am confused, and maybe I should just butt out of this, but:
>>
>> (a) BOM are designed to, um, mark the byte order...
>>
>> (b) in connections.c we have
>>
>> if(checkBOM && con->inavail >= 2 &&
>> ((int)con->iconvbuff[0] & 0xff) == 255 &&
>> ((int)con->iconvbuff[1] & 0xff) == 254) {
>> con->inavail -= (short) 2;
>> memmove(con->iconvbuff, con->iconvbuff+2,
>> con->inavail);
>> }
>> which checks for the two first bytes being FF, FE. However, a
>> big-endian BOM would be FE, FF and I see no check for that.
> I think this is correct, it is executed only for encodings declared
> little-endian (UTF-16LE, UCS2-LE) - so, iconv will still know what is
> the byte-order from the name of the encoding, it will just not see the
> same information in the BOM.
>>
>> Duncan's file starts
>>
>>> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>> what="raw", n=10)
>> [1] ff fe 74 00 69 00 6d 00 65 00
>>
>> so the BOM does indeed indicate little-endian, but apparently we
>> proceed to discard it and read the file with system (big-)endianness,
>> which strikes me as just plain wrong...
> I've tested we are not discarding it by the code above and that iconv
> gets to see the BOM bytes.
>>
>> I see no Mac-specific code for this, only win_iconv.c, so presumably
>> we have potential issues on everything non-Windows?
>
> I can reproduce the problem and will have a closer look, it may still
> be there is a bug in R. We have some work-arounds for recent iconv
> issues on macOS in sysutils.c.
This is a problem in macOS libiconv. When converting from "UTF-16" with
a BOM, it correctly learns the byte-order from the BOM, but later
forgets it in some cases. This is not a problem in R, but could be
worked-around in R.
As Simon wrote, to avoid running into these problems (in released
versions of R), one should use "UTF-16LE", so explicitly specify the
byte-order in the encoding name. This is useful also because it is not
clear what should be the default when no BOM is present and different
systems have different defaults.
Best
Tomas
>
> Tomas
>
>>
>> -pd
>>
>>> On 9 Sep 2024, at 01:11 , Simon Urbanek
>>> <simon.urbanek using r-project.org> wrote:
>>>
>>> From the help page:
>>>
>>> The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
>>> as they are appropriate values for Windows ‘Unicode’ text files.
>>> If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
>>> are removed as some implementations of ‘iconv’ do not accept BOMs.
>>>
>>> so "UTF-16LE" is the documented way to reliably read such files.
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>>
>>>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.duncan using gmail.com>
>>>> wrote:
>>>>
>>>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>>>>
>>>> On R-help there's a thread about reading a remote file that is
>>>> coded in UTF-16LE with a byte-order mark. Jeff Newmiller pointed
>>>> out
>>>> (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html)
>>>> that it would be better to declare the encoding as "UTF-16",
>>>> because the BOM will indicate little endian.
>>>>
>>>> I tried this on my Mac running R 4.4.1, and it didn't work. I get
>>>> the same incorrect result from all of these commands:
>>>>
>>>> # Automatically recognizing a URL and using fileEncoding:
>>>> read.delim(
>>>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>>
>>>> fileEncoding = "UTF-16"
>>>> )
>>>>
>>>> # Using explicit url() with encoding:
>>>> read.delim(
>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>>
>>>> encoding = "UTF-16")
>>>> )
>>>>
>>>> # Specifying the endianness incorrectly:
>>>> read.delim(
>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>>
>>>> encoding = "UTF-16BE")
>>>> )
>>>>
>>>> The only way I get the correct result is if I specify "UTF-16LE"
>>>> explicitly, whereas Jeff got correct results on several different
>>>> systems using "UTF-16".
>>>>
>>>> Is this a MacOS bug or an R for MacOS bug?
>>>>
>>>> Duncan Murdoch
>>>>
>>>> _______________________________________________
>>>> R-SIG-Mac mailing list
>>>> R-SIG-Mac using r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>>
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> R-SIG-Mac using r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
More information about the R-SIG-Mac
mailing list