[Rd] URL checks
Kirill Müller
kr|m|r+r @end|ng |rom m@||box@org
Thu Jan 7 12:14:02 CET 2021
Hi
The URL checks in R CMD check test all links in the README and vignettes
for broken or redirected links. In many cases this improves
documentation, I see problems with this approach which I have detailed
below.
I'm writing to this mailing list because I think the change needs to
happen in R's check routines. I propose to introduce an "allow-list" for
URLs, to reduce the burden on both CRAN and package maintainers.
Comments are greatly appreciated.
Best regards
Kirill
# Problems with the detection of broken/redirected URLs
## 301 should often be 307, how to change?
Many web sites use a 301 redirection code that probably should be a 307.
For example, https://www.oracle.com and https://www.oracle.com/ both
redirect to https://www.oracle.com/index.html with a 301. I suspect the
company still wants oracle.com to be recognized as the primary entry
point of their web presence (to reserve the right to move the
redirection to a different location later), I haven't checked with their
PR department though. If that's true, the redirect probably should be a
307, which should be fixed by their IT department which I haven't
contacted yet either.
$ curl -i https://www.oracle.com
HTTP/2 301
server: AkamaiGHost
content-length: 0
location: https://www.oracle.com/index.html
...
## User agent detection
twitter.com responds with a 400 error for requests without a user agent
string hinting at an accepted browser.
$ curl -i https://twitter.com/
HTTP/2 400
...
<body>...<p>Please switch to a supported browser...</p>...</body>
$ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
HTTP/2 200
# Impact
While the latter problem *could* be fixed by supplying a browser-like
user agent string, the former problem is virtually unfixable -- so many
web sites should use 307 instead of 301 but don't. The above list is
also incomplete -- think of unreliable links, HTTP links, other failure
modes...
This affects me as a package maintainer, I have the choice to either
change the links to incorrect versions, or remove them altogether.
I can also choose to explain each broken link to CRAN, this subjects the
team to undue burden I think. Submitting a package with NOTEs delays the
release for a package which I must release very soon to avoid having it
pulled from CRAN, I'd rather not risk that -- hence I need to remove the
link and put it back later.
I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
problem but ultimately doesn't solve it.
# Proposed solution
## Allow-list
A file inst/URL that lists all URLs where failures are allowed --
possibly with a list of the HTTP codes accepted for that link.
Example:
https://oracle.com/ 301
https://twitter.com/drob/status/1224851726068527106 400
More information about the R-devel
mailing list