[R-pkg-devel] Best practices for distributing large data files

Blätte, Andreas @ndre@@@b|@ette @end|ng |rom un|-due@de
Wed Feb 16 10:24:11 CET 2022


Dear Rafa, 

AWS is a good option, and we are very satisfied with Zenodo for data that can be made publicly available.

See the corpus_install() function in the 'cwbtools' package I maintain: https://github.com/PolMine/cwbtools/blob/master/R/corpus.R It offers a download option from both AWS and Zenodo, e.g. to download / install the (~ 1 GB) GermaParl corpus: https://doi.org/10.5281/zenodo.3742113

Zenodo is easy to use and has the great advantage that a DOI is assigned automatically. AWS is our option for restricted data. Yet managing access rights is appropriately with AWS is not easy. Users often need assistance to create credential files. 

Kind regards
Andreas 




Am 16.02.22, 03:55 schrieb "R-package-devel im Auftrag von Ayala Hernandez, Rafael" <r-package-devel-bounces using r-project.org im Auftrag von r.ayala14 using imperial.ac.uk>:

    Dear all,

    I am currently trying to think of the best way to distribute large sets of coefficients required by my package asteRisk.

    At the moment, I am using an accessory data package, asteRiskData, available from a drat repository, that bundles all of the required coefficients already parsed and stored as R objects.

    However, as my package grows, the amount of data required is also growing. This has made the size of asteRiskData grow larger, reaching 99.99 MB at the moment, which is at the limit of what would be upload able to GitHub. Since the source package must be uploaded a a single .tar.gz file for the drat repository, I see no easy workaround, other than splitting it into multiple, accessory data packages.

    I believe this option could become rather troublesome in the future, if the number of accessory data packages starts to grow too much.

    So I would like to ask, is there any recommended procedure for distributing such large data files? 

    Another option that has been suggested to me is not to use an accessory data package at all, but instead download and parse the required data on demand from the corresponding internet resources, store them locally, and then have future sessions load them from the local copies, therefore not requiring download and parsing in every R session, but only once (or possibly only once in a while, if the associated resource is updated). However, this would be leaving files of relatively large size (several 10s of MB) scattered in the local environment of users (instead of having them all centralized in the accessory data package). Is this option acceptable as well?

    Thanks a lot in advance for any insights

    Best wishes,

    Rafa
    ______________________________________________
    R-package-devel using r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-package-devel



More information about the R-package-devel mailing list