[R-pkg-devel] False positives for CRAN url check on DOIs due to Cloudflare anti-DDoS

Thu Mar 3 09:35:09 CET 2022

On Thu, Mar 3, 2022, 1:53 AM Martin Maechler <maechler using stat.math.ethz.ch>
wrote:

>
>
> This occurrence of false *positives* (not "negatives"),

Ah, I took the glass-half-full view that the purpose of the check is to
verify URLs are correct rather than detect incorrect ones ;).

short
> FPs, in url checks has been happening for a long time (more than
> a year) and is a bit unfortunate.
> As you allude to, it's from webservers where the maintainers do not want
> non-interactive "mass download" or web scraping etc to happen.
>
> I agree that it's particularly unpleasant in the case of DOIs;
> OTOH, we can live with it, and I think the CRAN and Bioconductor
> teams are well aware of the issue.
>

Glad they're aware of it.

> I also don't think it's worth starting a "fight" with the webservice
> (trying to camouflage our url checking as coming from an
>  interactive web browser),  but maybe we (R Core in conjunction
> with the CRAN team) should introduce special 'URLNOTE' for the
> URL checks, and be aware that these maybe FP.
>

Thanks, something like this would be helpful. I do wonder if you considered
my other suggestion later in the message - that is, to not follow redirects
in the URL checks just for DOIs (i.e. to use curl -I for DOIs)? It is my
understanding DOI checks are already a special case compared to normal
URLs, and I think this solution would be more robust for DOIs than the
current approach. It also would be lower burden for CRAN reviewers and
package authors alike by eliminating the need for human review in some
cases.

Best,

---Matt

> Martin
>
>     > Here are the headers of the (final, after redirects) 503
>     > response via curl:
>
>     > HTTP/1.1 503 Service Temporarily Unavailable
>     >> Date: Wed, 02 Mar 2022 19:19:39 GMT Content-Type:
>     >> text/html; charset=UTF-8 Connection: close
>     >> X-Frame-Options: SAMEORIGIN Permissions-Policy:
>     >>
> accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geoloca
>     >>
>     >>
> tion=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),
>     >> screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
>     >> Cache-Control: private, max-age=0, no-store, no-cache,
>     >> must-revalidate, post-check=0, pre-check=0 Expires: Thu,
>     >> 01 Jan 1970 00:00:01 GMT Expect-CT: max-age=604800,
>     >> report-uri="
>     >> https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
>     >> Set-Cookie:
>     >>
> __cf_bm=3sNb43ySbB.qquj3NGAjnfBHzV_o5LYBOHjCtdj.HXo-1646248779-0-Aa4oCF746nYwQXhKGA2xJIJ5jvxJXbsPXtflcSqS
>     >>
> q+AjMfQYuLyNSo3fFeCBGKEdS2rIjyDW0HkT40tB1OuDxZ2sWnUZG+y/4dpZsY54IWU4;
>     >> path=/; expires=Wed, 02-Mar-22 19:49:39 GMT; do
>     >> main=.onlinelibrary.wiley.com; HttpOnly; Secure;
>     >> SameSite=None Strict-Transport-Security: max-age=15552000
>     >> Server: cloudflare CF-RAY: 6e5c7c376d11c518-ORD
>
>
>     > We can see the response is coming from Cloudflare, which
>     > runs an anti-DDoS service. More discussion on that here
>     > <
> https://superuser.com/questions/888507/problems-with-wget-to-a-cloudflare-hosted-site-503-service-unavailable
> >,
>     > but the upshot seems to be that without a JavaScript
>     > enabled browser this probably is not a solvable problem.
>
>     > Anecdotally, this seems like a problem that could get
>     > worse: a few months ago I had one false negative DOI in a
>     > package submission, more recently (about a week ago) I had
>     > two. I have heard from other authors encountering this
>     > problem lately as well.
>
>     > Strategically, this disincentivizes package authors from
>     > accurately citing sources in order to avoid spurious NOTEs
>     > (without which, a whitelisted package can bypass human
>     > inspection and get to CRAN faster).
>
>     > Perhaps a solution might be to use a dedicated API to
>     > check DOIs. CrossRef offers one such, though it only works
>     > for CrossRef DOIs:
>     > https://api.crossref.org/swagger-ui/index.html
>
>     > For example, applying the API to one of the failing DOIs
>     > from above returns 200 OK:
>
>     > curl -I -L
>     > https://api.crossref.org/works/10.1177/096228029200100102/agency
>
>     > Another option might be to just look at the first response
>     > to the DOI request (i.e., not to follow redirects; just
>     > curl -I not curl -I -L). I *think* this should return 302
>     > on valid DOIs and 404 on invalid DOIs. For example, this
>     > returns 302 Found:
>
>     > curl -I https://doi.org/10.1111/1467-985X.00120
>
>     > And this non-existent DOI returns 404:
>
>     > curl -I https://doi.org/10.1111/1467-985X.XXXXXX
>
>     > So perhaps just using curl -I to check DOI urls would
>     > solve the problem?
>
>     > Thanks,
>
>     > ---Matt
>
>
>     > --
>     > As an adherent of Email Friday
>     > <https://mjskay.medium.com/the-doctrine-of-email-friday-add7f8332d80
> >,
>     > if your email is not urgent I likely won't reply until the
>     > next Friday.
>     > --
>     > Matthew Kay Assistant Professor Northwestern University
>     > Computer Science and Communication Studies
>     > mjskay using northwestern.edu http://www.mjskay.com/
>
>     >   [[alternative HTML version deleted]]
>
>     > ______________________________________________
>     > R-package-devel using r-project.org mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

	[[alternative HTML version deleted]]