[R-pkg-devel] Retrieving versioned csv datasets for use in an R package

Fri Feb 14 18:02:23 CET 2025

Seconded... have the support for obtaining the desired file be completely initiated by the user, and explicitly pass the filename into the functions that use the data. It is also easier to trace which file was used in a past analysis this way... auto config seems convenient, but it is hard to record the inputs used that way. You can make the function(s) that retrieve/cache the data as simple as you like, but please no simpler than specifying the data version somewhere in every script that uses the data.

On February 14, 2025 8:10:58 AM PST, Jan van der Laan <rhelp using eoos.dds.nl> wrote:
>
>Not an answer, but a request from someone often working behind firewalls and/or machines not connected to the internet. Please have a way to have the package search for the data at some user specified location such as a local directory.
>
>Best,
>
>Jan
>
>
>
>On 14-02-2025 15:54, John Clarke wrote:
>> Hi folks,
>> 
>> I've looked around for this particular question, but haven't found a good
>> answer. I have a versioned dataset that includes about 6 csv files that
>> total about 15MB for each version. The versions get updated every few years
>> or so and are used to drive the model which was written in C++ but is now
>> inside an Rcpp wrapper. Apart from the fact that CRAN does not permit large
>> files, I want to have a better way for users to access particular versions
>> of the dataset.
>> 
>> Usage idea:
>>   # The following would hopefully also download default/most recent version
>> of the csv files from CRAN (if allowed) or Github or some other repository
>> for academic open source data.
>> install.packages("MyPackage")
>> mypackage = new(MyPackage)
>> 
>> Then, if necessary, the user could change the dataset used with something
>> like:
>> mypackage.dataset("2.1.0") which would retrieve new csv files if they
>> haven't already been downloaded and update the data_folder path internally
>> to point to 2.1.0 directory.
>> 
>> Requirements:
>> - The dataset is csv (not a R data object) and the Rcpp MyPackage expects
>> this format
>> - Would be nice to properly include citations for the data as they will
>> likely be initially released through a journal publication
>> 
>> What is the best practice for this sort of dataset management for a package
>> in R? Is it okay to use Github to store and version the data? Or
>> preferred to use an R package (ignoring the file size limit). Or some other
>> open source data hosting? I see https://r-universe.dev/ as an option as
>> well. In any case, what is the proper mechanism for retrieving/caching the
>> data?
>> 
>> Thanks,
>> 
>> -John
>> 
>> John Clarke | Senior Technical Advisor |
>> Cornerstone Systems Northwest | john.clarke using cornerstonenw.com
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-package-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
>______________________________________________
>R-package-devel using r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-package-devel

-- 
Sent from my phone. Please excuse my brevity.