[R-pkg-devel] handling of byte-order-mark on r-devel-linux-x86_64-debian-clang machine

Tue Apr 5 20:20:37 CEST 2022

On 3/28/22 13:16, Ivan Krylov wrote:
> On Mon, 28 Mar 2022 09:54:57 +0200
> Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
>
>> Could you please clarify which part you found somewhat confusing,
>> could that be improved?
> Perhaps "somewhat confusing" is an overstatement, sorry about that. All
> the information is already there in both ?file and ?readLines, it just
> requires a bit of thought to understand it.
>
>>> When reading from a text connection, the connections code, after
>>> re-encoding based on the ‘encoding’ argument, returns text that is
>>> assumed to be in native encoding; an encoding mark is only added by
>>> functions that read from the connection, so e.g.  ‘readLines’ can
>>> be instructed to mark the text as ‘"UTF-8"’ or ‘"latin1"’, but
>>> ‘readLines’ does no further conversion.  To allow reading text in
>>> ‘"UTF-8"’ on a system that cannot represent all such characters in
>>> native encoding (currently only Windows), a connection can be
>>> internally configured to return the read text in UTF-8 even though
>>> it is not the native encoding; currently ‘readLines’ and ‘scan’ use
>>> this feature when given a connection that is not yet open and, when
>>> using the feature, they unconditionally mark the text as ‘"UTF-8"’.
> The paragraph starts by telling the user that the text is decoded into
> the native encoding, then tells about marking the encoding (which is
> counter-productive when decoding arbitrarily-encoded text into native
> encoding) and only then presents the exception to the native encoding
> output rule (decoding into UTF-8). If I'm trying to read a
> CP1252-encoded file on a Windows 7 machine with CP1251 as the session
> encoding, I might get confused by the mention of encoding mark between
> the parts that are important to me.
>
> It could be an improvement to mention that exception closer to the
> first point of the paragraph and, perhaps, to split the "encoding mark"
> part from the "text connection decoding" part:
>
>>> Functions that read from the connection can add an encoding mark
>>> to the returned text. For example, ‘readLines’ can be instructed
>>> to mark the text as ‘"UTF-8"’ or ‘"latin1"’, but does no further
>>> conversion.
>>>
>>> When given a connection that is not yet open and has a non-default
>>> ‘encoding’ argument, ‘readLines’ and ‘scan’ internally configure the
>>> connection to read text in UTF-8. Otherwise, the text after decoding
>>> is assumed to be in native encoding.
> (Maybe this is omitting too much and should be expanded.)
>
> It could also be helpful to mention the fact that the encoding argument
> to readLines() can be ignored right in the description of that
> argument, inviting the user to read the Details section for more
> information.

Thanks for the suggestions, I've rewritten the paragraphs, biasing 
towards users who have UTF-8 as the native encoding as this is going to 
be the majority. These users should not have to worry much about the 
encoding marks anymore, nor about the internal UTF-8 mode of the 
connections code. But the level of detail I think needs to remain as 
long as these features are supported - the level of detail is based on 
numerous questions and bug reports.

Best
Tomas