[R-pkg-devel] Large Data Package CRAN Preferences

Henrik Bengtsson henr|k@bengt@@on @end|ng |rom gm@||@com
Sun Dec 15 18:57:36 CET 2019


The R.cache package on CRAN provides can be used for this purpose. It
works on all platforms.  Per CRAN Policies, it will prompt the user
(in an interactive session) whether they wish to use a persistent
cache folder, or to fall back to temporary one.  For example,

> path <- R.cache::getCachePath(dirs = "MyDataPkg")
The R.cache package needs to create a directory that will hold cache
files. It is convenient to use  '/home/hb/.cache/R/R.cache' because it
follows the standard on your operating system and it remains also
after restarting R. Do you wish to create the
'/home/hb/.cache/R/R.cache' directory? If not, a temporary directory
(/tmp/hb/RtmpvEgWIr/.Rcache) that is specific to this R session will
be used. [Y/n]:
> path
[1] "/home/hb/.cache/R/R.cache/MyDataPkg"
>

Once the user have accepted this, this folder is created and will be
available in all future R session. That is, next time they start R
there will be no prompt:

> path <- R.cache::getCachePath(dirs = "MyDataPkg")
> path
[1] "/home/hb/.cache/R/R.cache/MyDataPkg"

This will also be the case in non-interactive session.  If that folder
does not exists in non-interactive session, then a temporary folder
will be used (= effectively making the cache lifetime equal to the
session lifetime).   'R CMD check' will always use a temporary cache
such that there is no memory between checks.

/Henrik
(disclaimer: I'm the author)

On Sun, Dec 15, 2019 at 9:27 AM <bill using denney.ws> wrote:
>
> Hi Uwe,
>
> Thanks for this information, and it makes sense to me.  Is there a preferred way to cache the data locally?
>
> None of the ways that I can think to cache the data sound particularly good, and I wonder if I'm missing something.  The ideas that occur to me are:
>
> 1. Download them into the package directory `path.package("datapkg")`, but that would require an action to be performed on package installation, and I'm unaware of any way to trigger an action on installation.
> 2. Have a user-specified cache directory (e.g. `options("datapkg_cache"="/my/cache/location")`), but that would require interaction with every use.  (Not horrible, but it will likely significantly increase the number of user issues with the package.)
> 3. Have a user-specified cache directory like #2, but have it default to somewhere in their home directory like `file.path(Sys.getenv("HOME"), "datapkg_cache")` if they have not set the option.
>
> To me #3 sounds best, but I'd like to be sure that I'm not missing something.
>
> Thanks,
>
> Bill
>
> -----Original Message-----
> From: Uwe Ligges <ligges using statistik.tu-dortmund.de>
> Sent: Sunday, December 15, 2019 11:54 AM
> To: bill using denney.ws; r-package-devel using r-project.org
> Subject: Re: [R-pkg-devel] Large Data Package CRAN Preferences
>
> Ideally yoiu wpuld host the data elsewhere and submit a CRAN package that allows users to easily get/merge/aggregate the data.
>
> Best,
> Uwe Ligges
>
>
>
> On 12.12.2019 20:55, bill using denney.ws wrote:
> > Hello,
> >
> >
> >
> > I have two questions about creating data packages for data that will
> > be updated and in total are >5 MB in size.
> >
> >
> >
> > The first question is:
> >
> >
> >
> > In the CRAN policy, it indicates that packages should be ?5 MB in size
> > in general.  Within a package that I'm working on, I need access to
> > data that are updated approximately quarterly, including the
> > historical datasets (specifically, these are the SDTM and CDASH
> > terminologies in https://evs.nci.nih.gov/ftp1/CDISC/SDTM/Archive/).
> >
> >
> >
> > Current individual data updates are approximately 1 MB when
> > individually saved as .RDS, and the total current set is about 20 MB.
> > I think that the preferred way to generate these packages since there
> > will be future updates is to generate one data package for each update
> > and then have an umbrella package that will depend on each of the individual data update packages.
> > That seems like it will minimize space requirements on CRAN since old
> > data will probably never need to be updated (though I will need to access it).
> >
> >
> >
> > Is that an accurate summary of the best practice for creating these as
> > a data package?
> >
> >
> >
> > And a second question is:
> >
> >
> >
> > Assuming the best practice is the one I described above, the typical
> > need will be to combine the individual historical datasets for local
> > use.  An initial test of the time to combine the data indicates that
> > it would take about 1 minute to do, but after combination, the result
> > could be loaded faster.  I'd like to store the combined dataset
> > locally with the umbrella package.  I believe that it is considered
> > poor form to write within the library location for a package except during installation.
> >
> >
> >
> > What is the best practice for caching the resulting large dataset
> > which is locally-generated?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Bill
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-package-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
> >
>
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel



More information about the R-package-devel mailing list