[R-pkg-devel] Automated checking defeated by website anti-scraping rules
Hugh Parsonage
hugh@p@r@on@ge @end|ng |rom gm@||@com
Sat Jun 14 02:33:55 CEST 2025
When checking a package on win-devel, I get the NOTE
Found the following (possibly) invalid URLs:
URL: http://classic.austlii.edu.au/au/legis/cth/consol_act/itaa1997240/s4.10.html
From: man/small_business_tax_offset.Rd
Status: 410
Message: Gone
URL: http://classic.austlii.edu.au/au/legis/cth/consol_act/mla1986131/
From: man/medicare_levy.Rd
Status: 410
Message: Gone
URL: https://guides.dss.gov.au/social-security-guide/3/4/1/10
From: man/age_pension_age.Rd
Status: 403
Message: Forbidden
The URLs exist (changing to https:// changes nothing) and are
accessible from a browser just fine. They appear to have those HTTP
statuses because of the servers' decision to block 'automated
requests'. As imbecilic as these rules might be (they can probably be
easily defeated), what should be the policy going forward? I can wrap
these URLs in \code{} to get past the checks, but a better solution
might be available at the check stage.
I think the fact that a check fails when a URL really has failed or
moved is a good thing and should be preserved. I don't just want to
get past the check.
Hugh Parsonage.
More information about the R-package-devel
mailing list