[Rd] URL checks

Kirill Müller kr|m|r+m| @end|ng |rom m@||box@org
Thu Jan 7 15:45:39 CET 2021


One other failure mode: SSL certificates trusted by browsers that are 
not installed on the check machine, e.g. the "GEANT Vereniging" 
certificate from https://relational.fit.cvut.cz/ .


K


On 07.01.21 12:14, Kirill Müller via R-devel wrote:
> Hi
>
>
> The URL checks in R CMD check test all links in the README and 
> vignettes for broken or redirected links. In many cases this improves 
> documentation, I see problems with this approach which I have detailed 
> below.
>
> I'm writing to this mailing list because I think the change needs to 
> happen in R's check routines. I propose to introduce an "allow-list" 
> for URLs, to reduce the burden on both CRAN and package maintainers.
>
> Comments are greatly appreciated.
>
>
> Best regards
>
> Kirill
>
>
> # Problems with the detection of broken/redirected URLs
>
> ## 301 should often be 307, how to change?
>
> Many web sites use a 301 redirection code that probably should be a 
> 307. For example, https://www.oracle.com and https://www.oracle.com/ 
> both redirect to https://www.oracle.com/index.html with a 301. I 
> suspect the company still wants oracle.com to be recognized as the 
> primary entry point of their web presence (to reserve the right to 
> move the redirection to a different location later), I haven't checked 
> with their PR department though. If that's true, the redirect probably 
> should be a 307, which should be fixed by their IT department which I 
> haven't contacted yet either.
>
> $ curl -i https://www.oracle.com
> HTTP/2 301
> server: AkamaiGHost
> content-length: 0
> location: https://www.oracle.com/index.html
> ...
>
> ## User agent detection
>
> twitter.com responds with a 400 error for requests without a user 
> agent string hinting at an accepted browser.
>
> $ curl -i https://twitter.com/
> HTTP/2 400
> ...
> <body>...<p>Please switch to a supported browser...</p>...</body>
>
> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux 
> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
> HTTP/2 200
>
> # Impact
>
> While the latter problem *could* be fixed by supplying a browser-like 
> user agent string, the former problem is virtually unfixable -- so 
> many web sites should use 307 instead of 301 but don't. The above list 
> is also incomplete -- think of unreliable links, HTTP links, other 
> failure modes...
>
> This affects me as a package maintainer, I have the choice to either 
> change the links to incorrect versions, or remove them altogether.
>
> I can also choose to explain each broken link to CRAN, this subjects 
> the team to undue burden I think. Submitting a package with NOTEs 
> delays the release for a package which I must release very soon to 
> avoid having it pulled from CRAN, I'd rather not risk that -- hence I 
> need to remove the link and put it back later.
>
> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the 
> problem but ultimately doesn't solve it.
>
> # Proposed solution
>
> ## Allow-list
>
> A file inst/URL that lists all URLs where failures are allowed -- 
> possibly with a list of the HTTP codes accepted for that link.
>
> Example:
>
> https://oracle.com/ 301
> https://twitter.com/drob/status/1224851726068527106 400
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list