[R-pkg-devel] How to store large data to be used in an R package?

Mon Mar 25 11:45:06 CET 2024

В Mon, 25 Mar 2024 11:12:57 +0100
Jairo Hidalgo Migueles <jairo.hidalgo.migueles using gmail.com> пишет:

> Specifically, this data consists of regression and random forest
> models crucial for making predictions within our R package.

Apologies for asking a silly question, but is there a chance that these
models are large by accident (e.g. because an object references a large
environment containing multiple copies of the training dataset)? Or it
is there really more than a million weights required to make
predictions?

> Initially, I attempted to save these models as internal data within
> the package. While this approach maintains functionality, it has led
> to a package size exceeding 20 MB. I'm concerned that this would
> complicate submitting the package to CRAN in the future.

The policy mentions the possibility of having a separate large
data-only package. Since CRAN strives to archive all package versions,
this data-only package will have to be updated as rarely as possible.
You will need to ask CRAN for approval.

If there is a significant amount of core functionality inside your
package that does *not* require the large data (so that it can still
be installed and used without the data), you can publish the data-only
package yourself (e.g. using the 'drat' package), put it in Suggests
and link to it in the Additional_repositories field of your DESCRIPTION.
Alternatively, you can publish the data on Zenodo and offer to download
it on first use. Make sure to (1) use tools::R_user_dir to determine
where to put the files, (2) only download the files after the user
explicitly agrees to it and (3) test as much of your package
functionality as possible without requiring the data to be downloaded.

-- 
Best regards,
Ivan