[R-pkg-devel] False positives for CRAN url check on DOIs due to Cloudflare anti-DDoS

Thu Mar 3 08:53:44 CET 2022

[changed subject:  s/negative/positive/ ! ]
>>>>> Matthew Kay 
>>>>>     on Wed, 2 Mar 2022 14:08:39 -0600 writes:

    > Hi,

    > I have recently noticed an increasing number of false
    > negatives with the CRAN url check on DOIs. The essence of
    > the problem is that certain DOIs will load fine in the
    > browser, but return a 503 error to `curl -I -L` and
    > therefore trigger a NOTE on CMD CHECK.

    > Two example DOIs I have run into already, which used to
    > pass fine without NOTEs (and still load fine in a
    > browser):

    > https://doi.org/10.1111/1467-985X.00120
    > https://doi.org/10.1177/096228029200100102

This occurrence of false *positives* (not "negatives"), short
FPs, in url checks has been happening for a long time (more than
a year) and is a bit unfortunate.
As you allude to, it's from webservers where the maintainers do not want
non-interactive "mass download" or web scraping etc to happen.

I agree that it's particularly unpleasant in the case of DOIs;
OTOH, we can live with it, and I think the CRAN and Bioconductor
teams are well aware of the issue.

I also don't think it's worth starting a "fight" with the webservice
(trying to camouflage our url checking as coming from an
 interactive web browser),  but maybe we (R Core in conjunction
with the CRAN team) should introduce special 'URLNOTE' for the
URL checks, and be aware that these maybe FP.

Martin

    > Here are the headers of the (final, after redirects) 503
    > response via curl:

    > HTTP/1.1 503 Service Temporarily Unavailable
    >> Date: Wed, 02 Mar 2022 19:19:39 GMT Content-Type:
    >> text/html; charset=UTF-8 Connection: close
    >> X-Frame-Options: SAMEORIGIN Permissions-Policy:
    >> accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geoloca
    >> 
    >> tion=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),
    >> screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
    >> Cache-Control: private, max-age=0, no-store, no-cache,
    >> must-revalidate, post-check=0, pre-check=0 Expires: Thu,
    >> 01 Jan 1970 00:00:01 GMT Expect-CT: max-age=604800,
    >> report-uri="
    >> https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    >> Set-Cookie:
    >> __cf_bm=3sNb43ySbB.qquj3NGAjnfBHzV_o5LYBOHjCtdj.HXo-1646248779-0-Aa4oCF746nYwQXhKGA2xJIJ5jvxJXbsPXtflcSqS
    >> q+AjMfQYuLyNSo3fFeCBGKEdS2rIjyDW0HkT40tB1OuDxZ2sWnUZG+y/4dpZsY54IWU4;
    >> path=/; expires=Wed, 02-Mar-22 19:49:39 GMT; do
    >> main=.onlinelibrary.wiley.com; HttpOnly; Secure;
    >> SameSite=None Strict-Transport-Security: max-age=15552000
    >> Server: cloudflare CF-RAY: 6e5c7c376d11c518-ORD

    > We can see the response is coming from Cloudflare, which
    > runs an anti-DDoS service. More discussion on that here
    > <https://superuser.com/questions/888507/problems-with-wget-to-a-cloudflare-hosted-site-503-service-unavailable>,
    > but the upshot seems to be that without a JavaScript
    > enabled browser this probably is not a solvable problem.

    > Anecdotally, this seems like a problem that could get
    > worse: a few months ago I had one false negative DOI in a
    > package submission, more recently (about a week ago) I had
    > two. I have heard from other authors encountering this
    > problem lately as well.

    > Strategically, this disincentivizes package authors from
    > accurately citing sources in order to avoid spurious NOTEs
    > (without which, a whitelisted package can bypass human
    > inspection and get to CRAN faster).

    > Perhaps a solution might be to use a dedicated API to
    > check DOIs. CrossRef offers one such, though it only works
    > for CrossRef DOIs:
    > https://api.crossref.org/swagger-ui/index.html

    > For example, applying the API to one of the failing DOIs
    > from above returns 200 OK:

    > curl -I -L
    > https://api.crossref.org/works/10.1177/096228029200100102/agency

    > Another option might be to just look at the first response
    > to the DOI request (i.e., not to follow redirects; just
    > curl -I not curl -I -L). I *think* this should return 302
    > on valid DOIs and 404 on invalid DOIs. For example, this
    > returns 302 Found:

    > curl -I https://doi.org/10.1111/1467-985X.00120

    > And this non-existent DOI returns 404:

    > curl -I https://doi.org/10.1111/1467-985X.XXXXXX

    > So perhaps just using curl -I to check DOI urls would
    > solve the problem?

    > Thanks,

    > ---Matt

    > -- 
    > As an adherent of Email Friday
    > <https://mjskay.medium.com/the-doctrine-of-email-friday-add7f8332d80>,
    > if your email is not urgent I likely won't reply until the
    > next Friday.
    > --
    > Matthew Kay Assistant Professor Northwestern University
    > Computer Science and Communication Studies
    > mjskay using northwestern.edu http://www.mjskay.com/

    > 	[[alternative HTML version deleted]]

    > ______________________________________________
    > R-package-devel using r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-package-devel