[R-pkg-devel] Retrieving versioned csv datasets for use in an R package

Fri Feb 14 16:08:01 CET 2025

Hi John,

There are different alternatives on where to host the data (e.g. OSF, a
proprietary server, Github etc). The solution I've been adopting in most of
my packages is to use a combination of  a  proprietary server and Github.
So the data is first downloaded from our own server and only if our server
is offline, then the download is redirected to Github. This is what I try
to do so our packages do not overload Github. Of course, this creates some
additional work from our side to make sure the files in our server are
always mirrored on github.

A key point to pay attention to when hosting the data on Github is to host
it as an attachment to a *release* . A good way to manage the files and
releases is using the {piggyback} package, by Carl Boettiger et al at
ROpenSci. The documentation of the package is a really great guide on how
to host data on github and it has some really convenient functions to
create releases, upload and download files. Kudos to them !
https://docs.ropensci.org/piggyback/

Best,

Rafael Pereira

On Fri, Feb 14, 2025 at 11:55 AM John Clarke <john.clarke using cornerstonenw.com>
wrote:

> Hi folks,
>
> I've looked around for this particular question, but haven't found a good
> answer. I have a versioned dataset that includes about 6 csv files that
> total about 15MB for each version. The versions get updated every few years
> or so and are used to drive the model which was written in C++ but is now
> inside an Rcpp wrapper. Apart from the fact that CRAN does not permit large
> files, I want to have a better way for users to access particular versions
> of the dataset.
>
> Usage idea:
>  # The following would hopefully also download default/most recent version
> of the csv files from CRAN (if allowed) or Github or some other repository
> for academic open source data.
> install.packages("MyPackage")
> mypackage = new(MyPackage)
>
> Then, if necessary, the user could change the dataset used with something
> like:
> mypackage.dataset("2.1.0") which would retrieve new csv files if they
> haven't already been downloaded and update the data_folder path internally
> to point to 2.1.0 directory.
>
> Requirements:
> - The dataset is csv (not a R data object) and the Rcpp MyPackage expects
> this format
> - Would be nice to properly include citations for the data as they will
> likely be initially released through a journal publication
>
> What is the best practice for this sort of dataset management for a package
> in R? Is it okay to use Github to store and version the data? Or
> preferred to use an R package (ignoring the file size limit). Or some other
> open source data hosting? I see https://r-universe.dev/ as an option as
> well. In any case, what is the proper mechanism for retrieving/caching the
> data?
>
> Thanks,
>
> -John
>
> John Clarke | Senior Technical Advisor |
> Cornerstone Systems Northwest | john.clarke using cornerstonenw.com
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

	[[alternative HTML version deleted]]