[R-pkg-devel] Automated checking defeated by website anti-scraping rules
Iris Simmons
|kw@|mmo @end|ng |rom gm@||@com
Sat Jun 14 02:38:46 CEST 2025
I have a package that throws the same NOTE when checked, the CRAN
maintainers just let it pass every time. I wouldn't worry about it.
On Fri, Jun 13, 2025, 20:35 Hugh Parsonage <hugh.parsonage using gmail.com> wrote:
> When checking a package on win-devel, I get the NOTE
>
> Found the following (possibly) invalid URLs:
> URL:
> http://classic.austlii.edu.au/au/legis/cth/consol_act/itaa1997240/s4.10.html
> From: man/small_business_tax_offset.Rd
> Status: 410
> Message: Gone
> URL: http://classic.austlii.edu.au/au/legis/cth/consol_act/mla1986131/
> From: man/medicare_levy.Rd
> Status: 410
> Message: Gone
> URL: https://guides.dss.gov.au/social-security-guide/3/4/1/10
> From: man/age_pension_age.Rd
> Status: 403
> Message: Forbidden
>
> The URLs exist (changing to https:// changes nothing) and are
> accessible from a browser just fine. They appear to have those HTTP
> statuses because of the servers' decision to block 'automated
> requests'. As imbecilic as these rules might be (they can probably be
> easily defeated), what should be the policy going forward? I can wrap
> these URLs in \code{} to get past the checks, but a better solution
> might be available at the check stage.
>
> I think the fact that a check fails when a URL really has failed or
> moved is a good thing and should be preserved. I don't just want to
> get past the check.
>
>
> Hugh Parsonage.
>
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
[[alternative HTML version deleted]]
More information about the R-package-devel
mailing list