[R-pkg-devel] handling of byte-order-mark on r-devel-linux-x86_64-debian-clang machine

Wed Apr 6 00:02:41 CEST 2022

Thanks to the ubiquity of Excel and its misguided inclusion of BOM codes in its UTF-8 CSV format, this optimism about encoding being a corner case seems premature. There are actually multiple options in Excel for writing CSV files, and only one of them (not the first one fortunately) has this "feature", but I (and various beginners I end up helping) seem to encounter these silly files far more frequently than seems reasonable.

On April 5, 2022 11:20:37 AM PDT, Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
>
>On 3/28/22 13:16, Ivan Krylov wrote:
>> On Mon, 28 Mar 2022 09:54:57 +0200
>> Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
>>
>>> Could you please clarify which part you found somewhat confusing,
>>> could that be improved?
>> Perhaps "somewhat confusing" is an overstatement, sorry about that. All
>> the information is already there in both ?file and ?readLines, it just
>> requires a bit of thought to understand it.
>>
>>>> When reading from a text connection, the connections code, after
>>>> re-encoding based on the ‘encoding’ argument, returns text that is
>>>> assumed to be in native encoding; an encoding mark is only added by
>>>> functions that read from the connection, so e.g.  ‘readLines’ can
>>>> be instructed to mark the text as ‘"UTF-8"’ or ‘"latin1"’, but
>>>> ‘readLines’ does no further conversion.  To allow reading text in
>>>> ‘"UTF-8"’ on a system that cannot represent all such characters in
>>>> native encoding (currently only Windows), a connection can be
>>>> internally configured to return the read text in UTF-8 even though
>>>> it is not the native encoding; currently ‘readLines’ and ‘scan’ use
>>>> this feature when given a connection that is not yet open and, when
>>>> using the feature, they unconditionally mark the text as ‘"UTF-8"’.
>> The paragraph starts by telling the user that the text is decoded into
>> the native encoding, then tells about marking the encoding (which is
>> counter-productive when decoding arbitrarily-encoded text into native
>> encoding) and only then presents the exception to the native encoding
>> output rule (decoding into UTF-8). If I'm trying to read a
>> CP1252-encoded file on a Windows 7 machine with CP1251 as the session
>> encoding, I might get confused by the mention of encoding mark between
>> the parts that are important to me.
>>
>> It could be an improvement to mention that exception closer to the
>> first point of the paragraph and, perhaps, to split the "encoding mark"
>> part from the "text connection decoding" part:
>>
>>>> Functions that read from the connection can add an encoding mark
>>>> to the returned text. For example, ‘readLines’ can be instructed
>>>> to mark the text as ‘"UTF-8"’ or ‘"latin1"’, but does no further
>>>> conversion.
>>>>
>>>> When given a connection that is not yet open and has a non-default
>>>> ‘encoding’ argument, ‘readLines’ and ‘scan’ internally configure the
>>>> connection to read text in UTF-8. Otherwise, the text after decoding
>>>> is assumed to be in native encoding.
>> (Maybe this is omitting too much and should be expanded.)
>>
>> It could also be helpful to mention the fact that the encoding argument
>> to readLines() can be ignored right in the description of that
>> argument, inviting the user to read the Details section for more
>> information.
>
>Thanks for the suggestions, I've rewritten the paragraphs, biasing 
>towards users who have UTF-8 as the native encoding as this is going to 
>be the majority. These users should not have to worry much about the 
>encoding marks anymore, nor about the internal UTF-8 mode of the 
>connections code. But the level of detail I think needs to remain as 
>long as these features are supported - the level of detail is based on 
>numerous questions and bug reports.
>
>Best
>Tomas
>
>______________________________________________
>R-package-devel using r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-package-devel

-- 
Sent from my phone. Please excuse my brevity.