[R-pkg-devel] handling of byte-order-mark on r-devel-linux-x86_64-debian-clang machine

Mon Mar 28 13:16:10 CEST 2022

On Mon, 28 Mar 2022 09:54:57 +0200
Tomas Kalibera <tomas.kalibera using gmail.com> wrote:

> Could you please clarify which part you found somewhat confusing,
> could that be improved?

Perhaps "somewhat confusing" is an overstatement, sorry about that. All
the information is already there in both ?file and ?readLines, it just
requires a bit of thought to understand it.

>> When reading from a text connection, the connections code, after
>> re-encoding based on the ‘encoding’ argument, returns text that is
>> assumed to be in native encoding; an encoding mark is only added by
>> functions that read from the connection, so e.g.  ‘readLines’ can
>> be instructed to mark the text as ‘"UTF-8"’ or ‘"latin1"’, but
>> ‘readLines’ does no further conversion.  To allow reading text in
>> ‘"UTF-8"’ on a system that cannot represent all such characters in
>> native encoding (currently only Windows), a connection can be
>> internally configured to return the read text in UTF-8 even though
>> it is not the native encoding; currently ‘readLines’ and ‘scan’ use
>> this feature when given a connection that is not yet open and, when
>> using the feature, they unconditionally mark the text as ‘"UTF-8"’.

The paragraph starts by telling the user that the text is decoded into
the native encoding, then tells about marking the encoding (which is
counter-productive when decoding arbitrarily-encoded text into native
encoding) and only then presents the exception to the native encoding
output rule (decoding into UTF-8). If I'm trying to read a
CP1252-encoded file on a Windows 7 machine with CP1251 as the session
encoding, I might get confused by the mention of encoding mark between
the parts that are important to me.

It could be an improvement to mention that exception closer to the
first point of the paragraph and, perhaps, to split the "encoding mark"
part from the "text connection decoding" part:

>> Functions that read from the connection can add an encoding mark
>> to the returned text. For example, ‘readLines’ can be instructed
>> to mark the text as ‘"UTF-8"’ or ‘"latin1"’, but does no further
>> conversion.
>>
>> When given a connection that is not yet open and has a non-default
>> ‘encoding’ argument, ‘readLines’ and ‘scan’ internally configure the
>> connection to read text in UTF-8. Otherwise, the text after decoding
>> is assumed to be in native encoding.

(Maybe this is omitting too much and should be expanded.)

It could also be helpful to mention the fact that the encoding argument
to readLines() can be ignored right in the description of that
argument, inviting the user to read the Details section for more
information.

-- 
Best regards,
Ivan