[R-pkg-devel] handling of byte-order-mark on r-devel-linux-x86_64-debian-clang machine

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Mon Mar 28 09:54:57 CEST 2022


On 3/26/22 14:58, Ivan Krylov wrote:
> On Sat, 26 Mar 2022 11:34:00 +0000
> Daniel Kelley <Dan.Kelley using Dal.Ca> wrote:
>
>> This file starts with a byte-order-mark, and this is skipped over on
>> all but the r-devel-linux-x86_64-debian-clang machine
> Could you please explain how you came to this conclusion? I don't have
> much experience with testthat, but looking at the recent results at
> <https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-clang/oce-00check.html>,
> it seems that the whole header is mis-decoded as latin-1, not just the
> BOM included in one field name.
>
> Please correct me if my understanding is wrong, but I'm seeing you use
> readLines to read the header of the file:
>
>>> # encoding defaults to "UTF-8-BOM"
>>> text <- readLines(file, 1, encoding=encoding, warn=FALSE)
> The `encoding` argument of readLines() is documented as follows:
>
>>> encoding: encoding to be assumed for input strings.  It is used to
>>> mark character strings as known to be in Latin-1 or UTF-8: it is
>>> not used to re-encode the input. To do the latter, specify the
>>> encoding as part of the connection ‘con’ or via
>>> ‘options(encoding=)’: see the examples.
> It's unfortunate that you lack a clean way of reproducing the problem
> (shouldn't it consistently fail on all glibc/libiconv/??? versions?),
> but I think that the right thing to do here is to use
>
> readLines(file(file, encoding = encoding), ...)
>
> ...and not the `encoding` argument of readLines. (See also: somewhat
> confusing "Encoding" section in ?file.)

Thanks, yes, that is correct reading of the documentation. Could you 
please clarify which part you found somewhat confusing, could that be 
improved?

>
> Taking another look at the check log, I see:
>
>>> using session charset: ISO8859-15
> Since readLines() seems to return text with Encoding(.) ==
> 'unknown' (i.e. native encoding) when it doesn't recognise its
> `encoding` argument, I guess what happens here is that the UTF-8 text is
> interpreted as ISO8859-15, and the same thing used to happen on
> Windows, where the native encoding is the current ANSI code page. This
> gives me a reason to hope that the test will start passing on Windows
> too once you apply the fix.

Yes. The encodings are shown also on the check flavors page:
https://cran.r-project.org/web/checks/check_flavors.html

R-devel and (to-be) R 4.2 on Windows use UTF-8 as native encoding on 
recent Windows systems, but the older versions use other encodings, 
CP1252 on the CRAN check machine.

Best
Tomas


>



More information about the R-package-devel mailing list