[R-pkg-devel] False negatives for CRAN url check on DOIs due to Cloudflare anti-DDoS

Matthew Kay m@tthew@k@y @end|ng |rom gm@||@com
Wed Mar 2 21:08:39 CET 2022


Hi,

I have recently noticed an increasing number of false negatives with the
CRAN url check on DOIs. The essence of the problem is that certain DOIs
will load fine in the browser, but return a 503 error to `curl -I -L` and
therefore trigger a NOTE on CMD CHECK.

Two example DOIs I have run into already, which used to pass fine without
NOTEs (and still load fine in a browser):

https://doi.org/10.1111/1467-985X.00120
https://doi.org/10.1177/096228029200100102

Here are the headers of the (final, after redirects) 503 response via curl:

HTTP/1.1 503 Service Temporarily Unavailable
> Date: Wed, 02 Mar 2022 19:19:39 GMT
> Content-Type: text/html; charset=UTF-8
> Connection: close
> X-Frame-Options: SAMEORIGIN
> Permissions-Policy:
> accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geoloca
>
> tion=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),
> screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
> Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate,
> post-check=0, pre-check=0
> Expires: Thu, 01 Jan 1970 00:00:01 GMT
> Expect-CT: max-age=604800, report-uri="
> https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
> Set-Cookie:
> __cf_bm=3sNb43ySbB.qquj3NGAjnfBHzV_o5LYBOHjCtdj.HXo-1646248779-0-Aa4oCF746nYwQXhKGA2xJIJ5jvxJXbsPXtflcSqS
> q+AjMfQYuLyNSo3fFeCBGKEdS2rIjyDW0HkT40tB1OuDxZ2sWnUZG+y/4dpZsY54IWU4;
> path=/; expires=Wed, 02-Mar-22 19:49:39 GMT; do
> main=.onlinelibrary.wiley.com; HttpOnly; Secure; SameSite=None
> Strict-Transport-Security: max-age=15552000
> Server: cloudflare
> CF-RAY: 6e5c7c376d11c518-ORD


We can see the response is coming from Cloudflare, which runs an anti-DDoS
service. More discussion on that here
<https://superuser.com/questions/888507/problems-with-wget-to-a-cloudflare-hosted-site-503-service-unavailable>,
but the upshot seems to be that without a JavaScript enabled browser this
probably is not a solvable problem.

Anecdotally, this seems like a problem that could get worse: a few months
ago I had one false negative DOI in a package submission, more recently
(about a week ago) I had two. I have heard from other authors encountering
this problem lately as well.

Strategically, this disincentivizes package authors from accurately citing
sources in order to avoid spurious NOTEs (without which, a whitelisted
package can bypass human inspection and get to CRAN faster).

Perhaps a solution might be to use a dedicated API to check DOIs. CrossRef
offers one such, though it only works for CrossRef DOIs:
https://api.crossref.org/swagger-ui/index.html

For example, applying the API to one of the failing DOIs from above returns
200 OK:

curl -I -L https://api.crossref.org/works/10.1177/096228029200100102/agency

Another option might be to just look at the first response to the DOI
request (i.e., not to follow redirects; just curl -I not curl -I -L). I
*think* this should return 302 on valid DOIs and 404 on invalid DOIs. For
example, this returns 302 Found:

curl -I https://doi.org/10.1111/1467-985X.00120

And this non-existent DOI returns 404:

curl -I https://doi.org/10.1111/1467-985X.XXXXXX

So perhaps just using curl -I to check DOI urls would solve the problem?

Thanks,

---Matt


-- 
As an adherent of Email Friday
<https://mjskay.medium.com/the-doctrine-of-email-friday-add7f8332d80>, if
your email is not urgent I likely won't reply until the next Friday.
--
Matthew Kay
Assistant Professor
Northwestern University Computer Science and Communication Studies
mjskay using northwestern.edu
http://www.mjskay.com/

	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list