[R-pkg-devel] Intrinsic UTF-8 use in aspired CRAN package

Wed May 17 14:05:49 CEST 2023

Dear list,

I have a package 
https://github.com/dschuhmacher/kanjistat
whose very purpose depends on working with Japanese kanji characters (in UTF-8 encoding). Such characters appear vitally in the data sets, examples, tests, the vignette and the .Rd files.

My package checks fine with devtools::check on my system and via Github Actions produced with usethis::use_github_action_check_standard().
However, I would like to release the package on CRAN, and running R CMD check --as-cran gives me a number of headaches, mainly related to the production of pdf documents via latex as it seems to be not so easy to convince latex to typeset Japanese, see https://www.overleaf.com/learn/latex/Japanese

For the vignette, I can set in the Rmarkdown file
  pdf_document:
    latex_engine: lualatex
    includes:
      in_header: preamble.tex
and in the file preamble.tex
\usepackage{luatexja}
\usepackage{microtype}
This gives me a pdf-vignette that looks and checks fine (except that the abovementioned GitHub Actions don't seem to find lualatex, which is why the pdf output is commented out in the main branch on GitHub).

Unfortunately, I fail to find a similar solution for the pdf manual. R CMD check yields
--------------
checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
LaTeX errors found:
! Package inputenc Error: Unicode character 冷 (U+51B7)
(inputenc) not set up for use with LaTeX.
[and many more of the same]
* checking PDF version of manual without index ... ERROR
--------------
It seems that the pdf manual is generated by first producing a texinfo file and then running texi2dvi. From
https://www.gnu.org/software/texinfo/manual/texinfo/html_node/Inserting-Unicode.html
I take the message that texinfo does not do Japanese... Is there any way to work around the use of texinfo and use lualatex (with a preamble) instead? If not, is there a way to keep the UTF-8 encoded characters in the html help (I think this is very useful for the user!) and still produce a pdf that passes the check, e.g. by replacing the kanji characters automatically by their codepoints (or even a generic placeholder symbol) when generating the pdf manual?

Any thoughts and suggestions on this would be greatly appreciated! I think/hope then that the remaining problems in R CMD check are acceptable to the CRAN team given the nature of my package. They are:

1. Examples and tests fail if the check is not run in an UTF-8 locale.

2. checking data for non-ASCII characters ... NOTE
   Note: found 111752 marked UTF-8 strings

Many thanks,
Dominic Schuhmacher