[Bioc-devel] Data Package Size Issues (.idat and .rda)

Martin Morgan mtmorgan at fhcrc.org
Fri Nov 8 14:41:00 CET 2013


On 11/07/2013 09:26 PM, Nicolas De Jay wrote:
> Thanks for the prompt answer.  The data set I am packaging closely
> resembles that of minfiData except that there are 52 samples; the IDAT
> files together are some 800MB whereas the Rda file is closer to 150MB.
>   It is worth noting that my experiment data package will be submitted
> to Bioconductor along with a software package which makes use of these
> samples in the vignette.  With this in mind, can I omit the IDAT
> files?  If this goes against Bioconductor's underlying design, what
> would you say is the maximum size of an experiment data package?

Hi Nicolas -- Some things to bear in mind.

Files are compressed in package tar balls, so your IDAT files may have a 
considerably smaller effective size.

Generally, original text files are a much better way to store external data than 
Rda files. For instance, rda files require updating when / if the class 
definition changes, and the provenance and content of the data is unambiguous.

Experiment data packages are meant to provide reusable examples for pedagogic 
purposes. One would hope that minfiData fulfills this requirement. If not, then 
it would be better to continue the current discussion with Kasper and others in 
the community to identify an appropriately comprehensive data set for use across 
many relevant packages.

There is no formal statement about the maximum size of experiment data packages, 
but one would need to make a strong argument for why a Gb of experiment data is 
necessary (including why existing experiment data packages are fundamentally 
inadequate), especially if it is to support a single package.

Martin

>
> ---
> Nicolas De Jay
>
> On Thu, Nov 7, 2013 at 9:38 PM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com> wrote:
>> To give some background: it is true that the RGsetEx object (in
>> data/RGsetEx.rda) is a 1-1 correspondence with the raw data files in
>> inst/extdata, so one could consider it redundant.  However, having the IDAT
>> files are convenient for testing parsing, and also for other tools who want
>> to have 450k example data and not want to depend on minfi.  Those are the
>> two main reasons for including the raw data as well.  And then the fact that
>> while the data size is "big" it is only 6 samples.
>>
>> Best,
>> Kasper
>>
>>
>> On Thu, Nov 7, 2013 at 3:58 PM, Nicolas De Jay
>> <nicolas.dejay at mail.mcgill.ca> wrote:
>>>
>>> Hi,
>>>
>>> I am preparing a data package and using the minfiData package as a
>>> reference.  The .idat files in extdata and the .rda file in data are
>>> both present in both the compressed tarball source and the installed
>>> copy directory (in my case, under ~/R/x86-64.../3.0/minfiData).  Isn't
>>> this redundant?  Is there a way to have the prospective user only
>>> download the .rda files?
>>>
>>> Sorry if my question is misguided and thanks in advance for your help.
>>>
>>> ---
>>> Nicolas De Jay
>>> M.Sc. Student
>>> Department of Human Genetics
>>> Montreal Children's Hospital Research Institute, McGill University Health
>>> Centre
>>> 4060 Ste Catherine West, PT-239
>>> Montreal, QC H3Z2Z3, Canada
>>> T: (514) 412-4440 | E: nicolas.dejay at mail.mcgill.ca
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list