[R-pkg-devel] Scrapping R CRAN website from package

Maëlle SALMON m@e||e@@@|mon @end|ng |rom y@hoo@@e
Mon Jul 19 09:14:18 CEST 2021


Could pkgsearch http://r-hub.github.io/pkgsearch/ help with what you're doing, as it can queries all versions of CRAN packages? See http://r-hub.github.io/pkgsearch/reference/cran_package_history.html for the docs of the function cran_package_history().

It does not scrape CRAN pages, it uses an R-hub API (that gets the data from CRAN so similar to your idea of building a separate DB :-) ).

Maëlle.


Den lördag 17 juli 2021 02:21:55 CEST, <dbosak01 using gmail.com> skrev: 





Maciej:

There are other packages that query the CRAN site (cranlogs, etc.).  So it
seems the queries/fetches are generally allowed.  I can only find a couple
relevant mentions in the CRAN policies:


"Packages which use Internet resources should fail gracefully with an
informative message if the resource is not available or has changed (and not
give a check warning nor error)."

"Downloads of additional software or data as part of package installation or
startup should only use secure download mechanisms (e.g., 'https' or
'ftps'). For downloads of more than a few MB, ensure that a sufficiently
large timeout is set."


So it seems like what you are trying to do would be OK with the appropriate
cautions in place.  Obviously any test cases are going to have to run fast,
or it will get rejected for being too slow to check.

Just my reading of the policies.  Have never tried it.

David


-----Original Message-----
From: R-package-devel <r-package-devel-bounces using r-project.org> On Behalf Of
Maciej Nasinski
Sent: Friday, July 16, 2021 6:14 AM
To: r-package-devel using r-project.org
Subject: [R-pkg-devel] Scrapping R CRAN website from package

Dear Sir or Madam,

I am creating a new package `pacs` https://github.com/Polkas/pacs, which I
want to send to R CRAN shortly. However I am not sure about R CRAN policy
regarding scraping CRAN per package page with its archive.
More precisely I am fetching the data from
https://CRAN.R-project.org/package=%s and
https://cran.r-project.org/src/contrib/Archive/%s/ (downloading an old
tar.gz too).

Why I need this: I could read any DESCRIPTION files for any time point and
get a true dependency tree.  Moreover I could get a life duration of any
released package version, where shorter than 7 days are marked as risky. I
could compare a package min required dependencies difference before we
update it.  And much more.

I made a few notices like "Please as a courtesy to the R CRAN, don't
overload their server by constantly using this function." inside the
package.

Optionally If scrapping R CRAN from my package is a problem I will try to
build a separate DB with such data (updated everyday). Still any old tar.gz
has to be downloaded.

Maciej Nasinski, University of Warsaw

    [[alternative HTML version deleted]]

______________________________________________
R-package-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


______________________________________________
R-package-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel



More information about the R-package-devel mailing list