[R-pkg-devel] Retrieving versioned csv datasets for use in an R package
John Clarke
john@c|@rke @end|ng |rom corner@tonenw@com
Fri Feb 14 16:28:19 CET 2025
Thanks so much Rafael, I think piggyback is exactly what I was looking for.
I wonder if it is possible/best practice to include a call to it during the
install.packages('MyPackage') process so that the data is available prior
to running tests in the R CMD build Github Action (and also for users to
have the default/most recent dataset) downloaded alongside the package.
-John
On Fri, Feb 14, 2025 at 4:08 PM Rafael H. M. Pereira <
rafa.pereira.br using gmail.com> wrote:
> Hi John,
>
> There are different alternatives on where to host the data (e.g. OSF, a
> proprietary server, Github etc). The solution I've been adopting in most of
> my packages is to use a combination of a proprietary server and Github.
> So the data is first downloaded from our own server and only if our server
> is offline, then the download is redirected to Github. This is what I try
> to do so our packages do not overload Github. Of course, this creates some
> additional work from our side to make sure the files in our server are
> always mirrored on github.
>
> A key point to pay attention to when hosting the data on Github is to host
> it as an attachment to a *release* . A good way to manage the files and
> releases is using the {piggyback} package, by Carl Boettiger et al at
> ROpenSci. The documentation of the package is a really great guide on how
> to host data on github and it has some really convenient functions to
> create releases, upload and download files. Kudos to them !
> https://docs.ropensci.org/piggyback/
>
> Best,
>
> Rafael Pereira
>
> On Fri, Feb 14, 2025 at 11:55 AM John Clarke <
> john.clarke using cornerstonenw.com> wrote:
>
>> Hi folks,
>>
>> I've looked around for this particular question, but haven't found a good
>> answer. I have a versioned dataset that includes about 6 csv files that
>> total about 15MB for each version. The versions get updated every few
>> years
>> or so and are used to drive the model which was written in C++ but is now
>> inside an Rcpp wrapper. Apart from the fact that CRAN does not permit
>> large
>> files, I want to have a better way for users to access particular versions
>> of the dataset.
>>
>> Usage idea:
>> # The following would hopefully also download default/most recent version
>> of the csv files from CRAN (if allowed) or Github or some other repository
>> for academic open source data.
>> install.packages("MyPackage")
>> mypackage = new(MyPackage)
>>
>> Then, if necessary, the user could change the dataset used with something
>> like:
>> mypackage.dataset("2.1.0") which would retrieve new csv files if they
>> haven't already been downloaded and update the data_folder path internally
>> to point to 2.1.0 directory.
>>
>> Requirements:
>> - The dataset is csv (not a R data object) and the Rcpp MyPackage expects
>> this format
>> - Would be nice to properly include citations for the data as they will
>> likely be initially released through a journal publication
>>
>> What is the best practice for this sort of dataset management for a
>> package
>> in R? Is it okay to use Github to store and version the data? Or
>> preferred to use an R package (ignoring the file size limit). Or some
>> other
>> open source data hosting? I see https://r-universe.dev/ as an option as
>> well. In any case, what is the proper mechanism for retrieving/caching the
>> data?
>>
>> Thanks,
>>
>> -John
>>
>> John Clarke | Senior Technical Advisor |
>> Cornerstone Systems Northwest | john.clarke using cornerstonenw.com
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-package-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>
>
[[alternative HTML version deleted]]
More information about the R-package-devel
mailing list