[R-pkg-devel] URL checks

Thu Jun 30 12:05:16 CEST 2022

Ivan,
I am sure that we can make this work a bit better, and do I agree that it
working perfectly isn't going to happen.  I don't think that the behaviour
you're seeing is likely to be stateful: recording the fact that you have
made a previous request from a browser.  That type of protection is
implemented for some DDOS attacks, but it soaks up resources (money/speed)
to little point when there is no DDOS or when the DDOS is too large.
Remembering and looking up the request has a cost and isn't really
scalable.

I got errors for DOI and the .eu examples earlier today (but I didn't hit
the rstudio link) and I never accessed the pages using a browser. Removing
the -I from the curl request allowed them to succeed.

Without the http HEAD method I got 302 (redirects) from doi.org which seems
to indicate that the ID exists and 404 (not found) for an ID which did
not.   For DOI checks I suggest removing the nobody settings and treating
anything other than a 404 (not found) from the DOI.ORG web server as
success (or perhaps more precisely, regarding a 302 redirect as success, a
404 not found as a failure and anything else as potentially ambiguous or at
least we'd need to categorise them as temporary or permanent but I am not
sure how much better that extra complexity makes things) .  That would be
an improvement over the current behaviour for most references to doi.org.

I get the 403 code from rstudio that you do, I suspect that they are
checking the browser in a way that doi.org doesn't.  Thats probably to
protect their site text from content scraping and getting into an arms race
with them is likely to be pointless.  Forbidden does mean that the server
is there but we can't tell the difference between not found and any other
condition.  I'd suggest that a 403 (which means actively declined by the
web server) should be treated as success IF there is a cloudflare server
header as well and getting more CDNs added to the check over time.  You
aren't going to get access anyway.  It looks like the top three CDN vendors
are CloudFlare, Amazon and Akamai and getting coverage of those three would
get you about 90% coverage of CDN fronted hosts and CloudFlare are the
overwhelming market leader.

In summary:

   - removing nobody, which selects the HEAD method, may allow the
   composite indicators eu sites to work, meaning that sites that have removed
   support for head (not an uncommon thing to do at the prompting of IT
   auditors) will start to work.
   - removing nobody and then not following the redirect may allow the
   doi.org requests to work
   - seeing a 403 code and a cloudflare server header in response to a
   request should be regarded as success, its as positive as you are likely to
   see
   - check what the responses from Amazon and Akamai  look like to identify
   them (Amazon responses have a bunch of X-amzn-* headers in them and I
   looked at an Akamai site which included an x-akamai-transformed header in
   its response) - I would add logging to the build environment to collect the
   requests and response headers from failed requests to see what the overall
   behaviour is

I think its worth exploring to remove a bunch of the recurring questions
about URL lookup.  The question is whether the servers running the CRAN
checks see the same behaviour that I am seeing.  If we can get say two
thirds of errors resolved this way then everyone is better off.

None of this will increase the traffic rate from CRAN as a result of these
checks, and frankly I doubt that you are going to generate enough traffic
to show up in anyone's traffic analysis anyway.  The hits on doi.org are
likely to be the single largest group and doi.org clearly expect that
people will access the site from scripts, so I doubt that this will cause
more explicit blocking.  For myself I tend to get a bit antsy about sites
that submit failed requests over and over or the ones that seem to be
systematically scraping a site (meaning many thousands of requests and/or a
very organised pattern).

Greg

On Thu, 30 Jun 2022 at 18:36, Ivan Krylov <krylov.r00t using gmail.com> wrote:

> Greg,
>
> I realise you are trying to solve the problem and I thank you for
> trying to make the URL checks better for everyone. I probably sound
> defeatist in my e-mails; sorry about that.
>
> On Thu, 30 Jun 2022 17:49:49 +1000
> Greg Hunt <greg using firmansyah.com> wrote:
>
> > Do you have evidence that even without the use of HEAD that
> > CloudFlare is rejecting the CRAN checks?
>
> Unfortunately, yes, I think it's possible:
>
> $ curl -v
> https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages
> # ...skipping TLS logs...
> > GET /hc/en-us/articles/219949047-Installing-older-versions-of-packages
> HTTP/2
> > Host: support.rstudio.com
> > User-Agent: curl/7.64.0
> > Accept: */*
> >
> * Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
> < HTTP/2 403
> < date: Thu, 30 Jun 2022 08:13:01 GMT
>
> CloudFlare blocks are probabilistic. I *think* the reason I got a 403
> is because I didn't visit the page with my browser first. Switching
> from HEAD to GET might also increase the traffic flow, leading to more
> blocks from hosts not previously blocking the HEAD requests.
>
> CloudFlare's suggested solution would be Private Access Tokens [*], but
> that looks hard to implement (who would agree to sign those tokens?)
> and leaves other CDNs.
>
> > The CDN rejecting requests or flagging the service as temporarily
> > unavailable when there is a communication failure with its upstream
> > server is much the same behaviour that you would expect to see from
> > the usual kinds of protection that you'd apply to a web server (some
> > kind of filter/proxy/firewall) even without a CDN in place.
>
> My point was different. If the upstream is actually down, the page
> can't be served even to "valid" users, and the 503 error from
> CloudFlare should fail the URL check. On the other hand, if the 503
> error is due to the check tripping a bot detector, it could be
> reasonable to give that page a free pass.
>
> How can we distinguish those two situations? Could CloudFlare ask for a
> CAPTCHA first, then realise that the upstream is down and return
> another 503?
>
> Yes, this is a sensitivity vs specificity question, and we can trade
> some false positives (that we get now) for some false negatives
> (letting a legitimate error status from a CDN pass the test) to make
> life easier for package maintainers. Your suggestions are a step in the
> right direction, but there has to be a way to make them less fragile.
>
> --
> Best regards,
> Ivan
>
> [*]
>
> https://blog.cloudflare.com/eliminating-captchas-on-iphones-and-macs-using-new-standard/
>

	[[alternative HTML version deleted]]