[Bioc-devel] Data Package Size Issues (.idat and .rda)

Nicolas De Jay nicolas.dejay at mail.mcgill.ca
Fri Nov 8 19:54:38 CET 2013


In that case, I will try to see if the public databases have the kind
of data sets I am trying to package and run the idea by the team that
is assigned to the project I am developing.  Thank you Martin, Sean
and Kasper for your valuable insight!

---
Nicolas De Jay

On Fri, Nov 8, 2013 at 9:07 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>
>
> On Fri, Nov 8, 2013 at 8:41 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>
>> On 11/07/2013 09:26 PM, Nicolas De Jay wrote:
>>>
>>> Thanks for the prompt answer.  The data set I am packaging closely
>>> resembles that of minfiData except that there are 52 samples; the IDAT
>>> files together are some 800MB whereas the Rda file is closer to 150MB.
>>>   It is worth noting that my experiment data package will be submitted
>>> to Bioconductor along with a software package which makes use of these
>>> samples in the vignette.  With this in mind, can I omit the IDAT
>>> files?  If this goes against Bioconductor's underlying design, what
>>> would you say is the maximum size of an experiment data package?
>>
>>
>> Hi Nicolas -- Some things to bear in mind.
>>
>
> Hi, Nicolas.
>
> I just wanted to note that experiment data packages are meant as a
> convenient way to distribute data so that reproducible workflows and
> documentation can be created easily.  There are other options such as
> accessing the data directly from public repositories using Bioconductor
> tools that serve the same purpose.  While accessing such online resources
> does necessitate a one-time network connection (after which packages like
> GEOquery can use locally cached data), when appropriate datasets exist in
> public repositories, it may be a perfectly viable alternative to experiment
> data packages.  In this particular case, as of today in NCBI GEO, there are
> 1711 Human 450k samples with IDAT files available.  I am not arguing that
> this route should replace experiment data packages, just that stable public
> data resources are an alternative to them to consider.
>
> Sean
>
>
>>
>> Files are compressed in package tar balls, so your IDAT files may have a
>> considerably smaller effective size.
>>
>> Generally, original text files are a much better way to store external
>> data than Rda files. For instance, rda files require updating when / if the
>> class definition changes, and the provenance and content of the data is
>> unambiguous.
>>
>> Experiment data packages are meant to provide reusable examples for
>> pedagogic purposes. One would hope that minfiData fulfills this requirement.
>> If not, then it would be better to continue the current discussion with
>> Kasper and others in the community to identify an appropriately
>> comprehensive data set for use across many relevant packages.
>>
>> There is no formal statement about the maximum size of experiment data
>> packages, but one would need to make a strong argument for why a Gb of
>> experiment data is necessary (including why existing experiment data
>> packages are fundamentally inadequate), especially if it is to support a
>> single package.
>>
>> Martin
>>
>>>
>>> ---
>>> Nicolas De Jay
>>>
>>> On Thu, Nov 7, 2013 at 9:38 PM, Kasper Daniel Hansen
>>> <kasperdanielhansen at gmail.com> wrote:
>>>>
>>>> To give some background: it is true that the RGsetEx object (in
>>>> data/RGsetEx.rda) is a 1-1 correspondence with the raw data files in
>>>> inst/extdata, so one could consider it redundant.  However, having the
>>>> IDAT
>>>> files are convenient for testing parsing, and also for other tools who
>>>> want
>>>> to have 450k example data and not want to depend on minfi.  Those are
>>>> the
>>>> two main reasons for including the raw data as well.  And then the fact
>>>> that
>>>> while the data size is "big" it is only 6 samples.
>>>>
>>>> Best,
>>>> Kasper
>>>>
>>>>
>>>> On Thu, Nov 7, 2013 at 3:58 PM, Nicolas De Jay
>>>> <nicolas.dejay at mail.mcgill.ca> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am preparing a data package and using the minfiData package as a
>>>>> reference.  The .idat files in extdata and the .rda file in data are
>>>>> both present in both the compressed tarball source and the installed
>>>>> copy directory (in my case, under ~/R/x86-64.../3.0/minfiData).  Isn't
>>>>> this redundant?  Is there a way to have the prospective user only
>>>>> download the .rda files?
>>>>>
>>>>> Sorry if my question is misguided and thanks in advance for your help.
>>>>>
>>>>> ---
>>>>> Nicolas De Jay
>>>>> M.Sc. Student
>>>>> Department of Human Genetics
>>>>> Montreal Children's Hospital Research Institute, McGill University
>>>>> Health
>>>>> Centre
>>>>> 4060 Ste Catherine West, PT-239
>>>>> Montreal, QC H3Z2Z3, Canada
>>>>> T: (514) 412-4440 | E: nicolas.dejay at mail.mcgill.ca
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>



More information about the Bioc-devel mailing list