[R-pkg-devel] Retrieving versioned csv datasets for use in an R package

Sean Davis @e@nd@v| @end|ng |rom gm@||@com
Mon Mar 31 13:24:27 CEST 2025


Hi, all.

Zenodo does offer storage (I believe limited to 50GB per submission) and is backed by CERN with a guarantee of storage for at least 20 years (the life of CERN, could be extended).

I agree that Github is a viable alternative to Zenodo and one can use the two together easily. On github, one can use three different approaches: 1) check files into version control, 2) use git lfs, and 3) as release artifacts. Each has pros and cons.

If data are of a biomedical nature, one can ofter deposit in a biomedical repository, including one of dozens at NIH and EBI. NIH even recommends some open “generalist repositories” that include zenodo and OSF: https://www.nlm.nih.gov/NIHbmic/generalist_repositories.html

If one is looking to cater to the machine learning/AI community, hosting on huggingface is another option. Doing so is quite similar to hosting on github from a purely practical perspective.

Cloud storage systems such as AWS, GCP, and Azure are possibilities, but egress charges can be challenging to predict. Cloudflare R2 is s3-compatible and has no egress charges, making it a good choice for sharing particularly large files.

On the client side, Bioconductor has BiocFileCache which is a client-side package for caching files that have been downloaded. Other file download/cache packages are available, though I’m less familiar with them.

Just wanted to expand the list a bit.

Sean


From: R-package-devel <r-package-devel-bounces using r-project.org> on behalf of Dirk Eddelbuettel <edd using debian.org>
Date: Saturday, February 15, 2025 at 10:29 AM
To: Simon Urbanek <simon.urbanek using R-project.org>
Cc: R-package-devel using r-project.org <R-package-devel using r-project.org>
Subject: Re: [R-pkg-devel] Retrieving versioned csv datasets for use in an R package

On 15 February 2025 at 19:50, Simon Urbanek wrote:
| Github is not reliable enough for reproducible research (your files can
| disappear at any point - or can change without notice),

I'm curious: Do you have a concrete example of a no-longer-reproducible study
whose data or other support files changed and thereby caused this breakage?

| that's why Zenodo was created.

But AFAIK Zenodo offers DOI issuance only, not storage (as, say, OSF would).
So this does not address the problem faced by the OP.

Dirk

--
dirk.eddelbuettel.com | @eddelbuettel | edd using debian.org

______________________________________________
R-package-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list