[Rd] HTML documentation check works best with Tidy >= 5.0.0

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Sat Aug 5 22:52:55 CEST 2023


Hello R-devel,

Old versions of HTML Tidy report false positive NOTEs for the HTML
verison of the manual where Tidy encounters HTML5 features it is not
ready for.

Conveniently, both HTML5 support and release version numbers officially
appeared in HTML Tidy version 5.0.0 [*]. For example, the last version
of the "old Tidy" I could find fails on an R help page:

 cvs -z3 -d:pserver:anonymous using tidy.cvs.sourceforge.net:/cvsroot/tidy \
  co -P tidy
 cd tidy
 make -C build/gmake
 bin/tidy -v
 # HTML Tidy for Linux released on 25 March 2009
 R-devel CMD Rdconv .../R-devel/src/library/stats/man/lm.Rd -t html | \
  bin/tidy >/dev/null
 # line 4 column 1 - Warning: <link> inserting "type" attribute
 # line 12 column 1 - Warning: <script> proprietary attribute "onload"
 # line 12 column 1 - Warning: <script> inserting "type" attribute
 # line 17 column 1 - Warning: <table> lacks "summary" attribute
 # line 44 column 1 - Warning: <table> lacks "summary" attribute
 # line 200 column 1 - Warning: <table> lacks "summary" attribute
 # <...>

On the other hand, the oldest released version of the Tidy-HTML5 handles
Rd-produced HTML correctly:

 git clone https://github.com/htacg/tidy-html5
 cd tidy-html5
 git checkout 5.0.0
 mkdir b5.0.0
 cd b5.0.0
 cmake ..
 cmake --build .
 ./tidy -v
 # HTML Tidy for Linux version 5.0.0
 R-devel CMD Rdconv .../R-devel/src/library/stats/man/lm.Rd -t html | \
  ./tidy >/dev/null
 # Info: Document content looks like HTML5
 # No warnings or errors were found.
 # <...>

We can use this information to only use HTML Tidy versions that support
the idioms used by Rd2HTML:

--- check.R	(revision 84834)
+++ check.R	(working copy)
@@ -5040,7 +5040,7 @@
 
         t1 <- proc.time()
         if(i1) { ## validate
-            ## require HTML Tidy, and not macOS's ancient version.
+            ## require HTML5 Tidy, and not macOS's ancient version.
             msg <- ""
             Tidy <- Sys.getenv("R_TIDYCMD", "tidy")
             OK <- nzchar(Sys.which(Tidy))
@@ -5048,10 +5048,8 @@
                 ver <- system2(Tidy, "--version", stdout = TRUE)
                 OK <- startsWith(ver, "HTML Tidy")
                 if(OK) {
-                    OK <- !grepl('Apple Inc. build 2649', ver)
-                    if(!OK) msg <- ": 'tidy' is Apple's too old build"
-                    ## Maybe we should also check version,
-                    ## but e.g. Ubuntu 16.04 does not show one.
+                    OK <- grepl('version 5.\\d+\\.\\d+', ver)
+                    if(!OK) msg <- ": 'tidy' does not appear to be version 5"
                 } else msg <- ": 'tidy' is not HTML Tidy"
             } else msg <- ": no command 'tidy' found"
             if(OK) {

(This is just one way to solve the problem. Instead, we could discard
versions that say "released on <date>", or try to parse the version
specification and only discard it if (a) we can't parse it or (b) it's
below 5.0.0.)

With the patch applied, I get:

 PATH=.../tidy/bin:"$PATH" _R_CHECK_RD_VALIDATE_RD2HTML_=TRUE \
  R-devel CMD check $package.tar.gz
 # * checking HTML version of manual ... NOTE
 # Skipping checking HTML validation: 'tidy' does not appear to be 
 # version 5

 PATH=.../tidy-html5/b5.0.0/:"$PATH" _R_CHECK_RD_VALIDATE_RD2HTML_=TRUE \
  R-devel CMD check $package.tar.gz
 # * checking HTML version of manual ... OK

-- 
Best regards,
Ivan

[*] There are commits in the tidy-html5 repo containing versions marked
4.x.x, but they aren't tagged and weren't considered an official
release, as far as I know.



More information about the R-devel mailing list