[R-pkg-devel] Large Data Package CRAN Preferences

Sun Dec 15 18:27:09 CET 2019

Hi Uwe,

Thanks for this information, and it makes sense to me.  Is there a preferred way to cache the data locally?

None of the ways that I can think to cache the data sound particularly good, and I wonder if I'm missing something.  The ideas that occur to me are:

1. Download them into the package directory `path.package("datapkg")`, but that would require an action to be performed on package installation, and I'm unaware of any way to trigger an action on installation.
2. Have a user-specified cache directory (e.g. `options("datapkg_cache"="/my/cache/location")`), but that would require interaction with every use.  (Not horrible, but it will likely significantly increase the number of user issues with the package.)
3. Have a user-specified cache directory like #2, but have it default to somewhere in their home directory like `file.path(Sys.getenv("HOME"), "datapkg_cache")` if they have not set the option.

To me #3 sounds best, but I'd like to be sure that I'm not missing something.

Thanks,

Bill

-----Original Message-----
From: Uwe Ligges <ligges using statistik.tu-dortmund.de> 
Sent: Sunday, December 15, 2019 11:54 AM
To: bill using denney.ws; r-package-devel using r-project.org
Subject: Re: [R-pkg-devel] Large Data Package CRAN Preferences

Ideally yoiu wpuld host the data elsewhere and submit a CRAN package that allows users to easily get/merge/aggregate the data.

Best,
Uwe Ligges

On 12.12.2019 20:55, bill using denney.ws wrote:
> Hello,
> 
>   
> 
> I have two questions about creating data packages for data that will 
> be updated and in total are >5 MB in size.
> 
>   
> 
> The first question is:
> 
>   
> 
> In the CRAN policy, it indicates that packages should be ?5 MB in size 
> in general.  Within a package that I'm working on, I need access to 
> data that are updated approximately quarterly, including the 
> historical datasets (specifically, these are the SDTM and CDASH 
> terminologies in https://evs.nci.nih.gov/ftp1/CDISC/SDTM/Archive/).
> 
>   
> 
> Current individual data updates are approximately 1 MB when 
> individually saved as .RDS, and the total current set is about 20 MB.  
> I think that the preferred way to generate these packages since there 
> will be future updates is to generate one data package for each update 
> and then have an umbrella package that will depend on each of the individual data update packages.
> That seems like it will minimize space requirements on CRAN since old 
> data will probably never need to be updated (though I will need to access it).
> 
>   
> 
> Is that an accurate summary of the best practice for creating these as 
> a data package?
> 
>   
> 
> And a second question is:
> 
>   
> 
> Assuming the best practice is the one I described above, the typical 
> need will be to combine the individual historical datasets for local 
> use.  An initial test of the time to combine the data indicates that 
> it would take about 1 minute to do, but after combination, the result 
> could be loaded faster.  I'd like to store the combined dataset 
> locally with the umbrella package.  I believe that it is considered 
> poor form to write within the library location for a package except during installation.
> 
>   
> 
> What is the best practice for caching the resulting large dataset 
> which is locally-generated?
> 
>   
> 
> Thanks,
> 
>   
> 
> Bill
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-package-devel using r-project.org mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>