[R-pkg-devel] Best practices for distributing large data files

Wed Feb 16 03:55:01 CET 2022

Dear all,

I am currently trying to think of the best way to distribute large sets of coefficients required by my package asteRisk.

At the moment, I am using an accessory data package, asteRiskData, available from a drat repository, that bundles all of the required coefficients already parsed and stored as R objects.

However, as my package grows, the amount of data required is also growing. This has made the size of asteRiskData grow larger, reaching 99.99 MB at the moment, which is at the limit of what would be upload able to GitHub. Since the source package must be uploaded a a single .tar.gz file for the drat repository, I see no easy workaround, other than splitting it into multiple, accessory data packages.

I believe this option could become rather troublesome in the future, if the number of accessory data packages starts to grow too much.

So I would like to ask, is there any recommended procedure for distributing such large data files? 

Another option that has been suggested to me is not to use an accessory data package at all, but instead download and parse the required data on demand from the corresponding internet resources, store them locally, and then have future sessions load them from the local copies, therefore not requiring download and parsing in every R session, but only once (or possibly only once in a while, if the associated resource is updated). However, this would be leaving files of relatively large size (several 10s of MB) scattered in the local environment of users (instead of having them all centralized in the accessory data package). Is this option acceptable as well?

Thanks a lot in advance for any insights

Best wishes,

Rafa