[Rd] URL checks
Spencer Graves
@pencer@gr@ve@ @end|ng |rom prod@y@e@com
Fri Jan 8 13:04:13 CET 2021
I also would be pleased to be allowed to provide "a list of known
false-positive/exceptions" to the URL tests. I've been challenged
multiple times regarding URLs that worked fine when I checked them. We
should not be required to do a partial lobotomy to pass R CMD check ;-)
Spencer Graves
On 2021-01-07 09:53, Hugo Gruson wrote:
>
> I encountered the same issue today with https://astrostatistics.psu.edu/.
>
> This is a trust chain issue, as explained here:
> https://whatsmychaincert.com/?astrostatistics.psu.edu.
>
> I've worked for a couple of years on a project to increase HTTPS
> adoption on the web and we noticed that this type of error is very
> common, and that website maintainers are often unresponsive to requests
> to fix this issue.
>
> Therefore, I totally agree with Kirill that a list of known
> false-positive/exceptions would be a great addition to save time to both
> the CRAN team and package developers.
>
> Hugo
>
> On 07/01/2021 15:45, Kirill Müller via R-devel wrote:
>> One other failure mode: SSL certificates trusted by browsers that are
>> not installed on the check machine, e.g. the "GEANT Vereniging"
>> certificate from https://relational.fit.cvut.cz/ .
>>
>>
>> K
>>
>>
>> On 07.01.21 12:14, Kirill Müller via R-devel wrote:
>>> Hi
>>>
>>>
>>> The URL checks in R CMD check test all links in the README and
>>> vignettes for broken or redirected links. In many cases this improves
>>> documentation, I see problems with this approach which I have
>>> detailed below.
>>>
>>> I'm writing to this mailing list because I think the change needs to
>>> happen in R's check routines. I propose to introduce an "allow-list"
>>> for URLs, to reduce the burden on both CRAN and package maintainers.
>>>
>>> Comments are greatly appreciated.
>>>
>>>
>>> Best regards
>>>
>>> Kirill
>>>
>>>
>>> # Problems with the detection of broken/redirected URLs
>>>
>>> ## 301 should often be 307, how to change?
>>>
>>> Many web sites use a 301 redirection code that probably should be a
>>> 307. For example, https://www.oracle.com and https://www.oracle.com/
>>> both redirect to https://www.oracle.com/index.html with a 301. I
>>> suspect the company still wants oracle.com to be recognized as the
>>> primary entry point of their web presence (to reserve the right to
>>> move the redirection to a different location later), I haven't
>>> checked with their PR department though. If that's true, the redirect
>>> probably should be a 307, which should be fixed by their IT
>>> department which I haven't contacted yet either.
>>>
>>> $ curl -i https://www.oracle.com
>>> HTTP/2 301
>>> server: AkamaiGHost
>>> content-length: 0
>>> location: https://www.oracle.com/index.html
>>> ...
>>>
>>> ## User agent detection
>>>
>>> twitter.com responds with a 400 error for requests without a user
>>> agent string hinting at an accepted browser.
>>>
>>> $ curl -i https://twitter.com/
>>> HTTP/2 400
>>> ...
>>> <body>...<p>Please switch to a supported browser...</p>...</body>
>>>
>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>>> HTTP/2 200
>>>
>>> # Impact
>>>
>>> While the latter problem *could* be fixed by supplying a browser-like
>>> user agent string, the former problem is virtually unfixable -- so
>>> many web sites should use 307 instead of 301 but don't. The above
>>> list is also incomplete -- think of unreliable links, HTTP links,
>>> other failure modes...
>>>
>>> This affects me as a package maintainer, I have the choice to either
>>> change the links to incorrect versions, or remove them altogether.
>>>
>>> I can also choose to explain each broken link to CRAN, this subjects
>>> the team to undue burden I think. Submitting a package with NOTEs
>>> delays the release for a package which I must release very soon to
>>> avoid having it pulled from CRAN, I'd rather not risk that -- hence I
>>> need to remove the link and put it back later.
>>>
>>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
>>> problem but ultimately doesn't solve it.
>>>
>>> # Proposed solution
>>>
>>> ## Allow-list
>>>
>>> A file inst/URL that lists all URLs where failures are allowed --
>>> possibly with a list of the HTTP codes accepted for that link.
>>>
>>> Example:
>>>
>>> https://oracle.com/ 301
>>> https://twitter.com/drob/status/1224851726068527106 400
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list