[R-pkg-devel] Retrieving versioned csv datasets for use in an R package

Fri Feb 14 15:54:39 CET 2025

Hi folks,

I've looked around for this particular question, but haven't found a good
answer. I have a versioned dataset that includes about 6 csv files that
total about 15MB for each version. The versions get updated every few years
or so and are used to drive the model which was written in C++ but is now
inside an Rcpp wrapper. Apart from the fact that CRAN does not permit large
files, I want to have a better way for users to access particular versions
of the dataset.

Usage idea:
 # The following would hopefully also download default/most recent version
of the csv files from CRAN (if allowed) or Github or some other repository
for academic open source data.
install.packages("MyPackage")
mypackage = new(MyPackage)

Then, if necessary, the user could change the dataset used with something
like:
mypackage.dataset("2.1.0") which would retrieve new csv files if they
haven't already been downloaded and update the data_folder path internally
to point to 2.1.0 directory.

Requirements:
- The dataset is csv (not a R data object) and the Rcpp MyPackage expects
this format
- Would be nice to properly include citations for the data as they will
likely be initially released through a journal publication

What is the best practice for this sort of dataset management for a package
in R? Is it okay to use Github to store and version the data? Or
preferred to use an R package (ignoring the file size limit). Or some other
open source data hosting? I see https://r-universe.dev/ as an option as
well. In any case, what is the proper mechanism for retrieving/caching the
data?

Thanks,

-John

John Clarke | Senior Technical Advisor |
Cornerstone Systems Northwest | john.clarke using cornerstonenw.com

	[[alternative HTML version deleted]]