[R-pkg-devel] Retrieving versioned csv datasets for use in an R package

Simon Urbanek @|mon@urb@nek @end|ng |rom R-project@org
Sat Feb 15 07:50:28 CET 2025


I would like to second the Zenodo recommendation. Github is not reliable enough for reproducible research (your files can disappear at any point - or can change without notice), that's why Zenodo was created. It assumes that your package has the list of DOIs to offer, but that should be ideally the case, because you don't want to change the data after your package was published -- again, for reproducibility's sake.

Cheers,
Simon


> On Feb 15, 2025, at 7:58 AM, Thierry Onkelinx <thierry.onkelinx using inbo.be> wrote:
> 
> Dear John,
> 
> Our workflow for an open and reproducible workflow is to publish the data
> via Zenodo. https://zenodo.org/ is maintained by CERN.
> - The data is freely available.
> - Your data is easy to cite.
> - Every version gets its own DOI + one stable DOI that always points to the
> most recent version. E.g. https://doi.org/10.5281/zenodo.14179531
> 
> The zen4R package makes it easy to upload and download the data from within
> R. Our functions assume the data is in a local folder. Only when the data
> is missing, we try to download it from Zenodo.
> 
> Best regards,
> 
> ir. Thierry Onkelinx
> Statisticus / Statistician
> 
> Vlaamse Overheid / Government of Flanders
> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
> FOREST
> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
> thierry.onkelinx using inbo.be
> Havenlaan 88 bus 73, 1000 Brussel
> *Postadres:* Koning Albert II-laan 15 bus 186, 1210 Brussel
> *Poststukken die naar dit adres worden gestuurd, worden ingescand en
> digitaal aan de geadresseerde bezorgd. Zo kan de Vlaamse overheid haar
> dossiers volledig digitaal behandelen. Poststukken met de vermelding
> ‘vertrouwelijk’ worden niet ingescand, maar ongeopend aan de geadresseerde
> bezorgd.*
> www.inbo.be
> 
> ///////////////////////////////////////////////////////////////////////////////////////////
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
> ///////////////////////////////////////////////////////////////////////////////////////////
> 
> <https://www.inbo.be>
> 
> 
> Op vr 14 feb 2025 om 15:55 schreef John Clarke <
> john.clarke using cornerstonenw.com>:
> 
>> Hi folks,
>> 
>> I've looked around for this particular question, but haven't found a good
>> answer. I have a versioned dataset that includes about 6 csv files that
>> total about 15MB for each version. The versions get updated every few years
>> or so and are used to drive the model which was written in C++ but is now
>> inside an Rcpp wrapper. Apart from the fact that CRAN does not permit large
>> files, I want to have a better way for users to access particular versions
>> of the dataset.
>> 
>> Usage idea:
>> # The following would hopefully also download default/most recent version
>> of the csv files from CRAN (if allowed) or Github or some other repository
>> for academic open source data.
>> install.packages("MyPackage")
>> mypackage = new(MyPackage)
>> 
>> Then, if necessary, the user could change the dataset used with something
>> like:
>> mypackage.dataset("2.1.0") which would retrieve new csv files if they
>> haven't already been downloaded and update the data_folder path internally
>> to point to 2.1.0 directory.
>> 
>> Requirements:
>> - The dataset is csv (not a R data object) and the Rcpp MyPackage expects
>> this format
>> - Would be nice to properly include citations for the data as they will
>> likely be initially released through a journal publication
>> 
>> What is the best practice for this sort of dataset management for a package
>> in R? Is it okay to use Github to store and version the data? Or
>> preferred to use an R package (ignoring the file size limit). Or some other
>> open source data hosting? I see https://r-universe.dev/ as an option as
>> well. In any case, what is the proper mechanism for retrieving/caching the
>> data?
>> 
>> Thanks,
>> 
>> -John
>> 
>> John Clarke | Senior Technical Advisor |
>> Cornerstone Systems Northwest | john.clarke using cornerstonenw.com
>> 
>>        [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-package-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 



More information about the R-package-devel mailing list