[Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

Martin Morgan martin.morgan at roswellpark.org
Sat Apr 16 22:55:09 CEST 2016



On 04/16/2016 01:09 PM, Marcin Kosiński wrote:
> Hello,
>
> I would like to ask you all for an advice in the following issue.
>
> Last year I have started working with data from The Cancer Genome Atlas.
> During that work out team (https://github.com/orgs/RTCGA/people) have
> prepared some tools for downloading and integrating datasets from TCGA
> study and provided them in the R package called RTCGA
> <https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which is
> available on Bioconductor.
>
> Later on we were working on tools for visualizing and analyzing the most
> popular datasets from TCGA so we have prepared data packages with those
> datasets and submitted them to Bioconductor in 8 separate packages. You can
> read more about them here http://rtcga.github.io/RTCGA/
>
> *I have a question about updating those data packages.* TCGA release
> datasets snapshots over time. In the RTCGA family of R packages there are
> available datasets from the release date 2015-11-01 but currently one can
> check that there was newer release 2016-01-28
>
>> tail(RTCGA::checkTCGA('Dates'))
> [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
> "2016-01-28"
>
> I am wondering whether should we upload newer datasets to those data
> packages. We have found that there are great differences in results of data
> analysis depending on from which release date one has took datasets. More
> about this issue can be found here:
> http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata
>
> The current state of RTCGA family of R packages is listed below
>
> RTCGA.clinical
> <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html>
>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>    - BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0


>
> RTCGA.rnaseq
> <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html>
>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>    - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>
> RTCGA.mutations
> <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html>
>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>    - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>
> ---------------------------------------------------
>
> RTCGA.methylation
> <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html>
>    - BiocRelease: NOT YET AVAILABLE
>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1
>
>
> RTCGA.CNV
> <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html>
>    - BiocRelease: NOT YET AVAILABLE
>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5
>
>
> RTCGA.RPPA
> <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html>
>    - BiocRelease: NOT YET AVAILABLE
>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6
>
>
> RTCGA.mRNA
> <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html>
>    - BiocRelease: NOT YET AVAILABLE
>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3
>
>
> RTCGA.miRNASeq
> <http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html>
>    - BiocRelease: NOT YET AVAILABLE
>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4
>
>
> I think that having datasets from the newest snapshot date is vital for
> data analysis, but I wouldn't like to create situations in which 2 separate
> analysts use RTCGA.clinical and got different results because they used
> different data versions. That's why I have started versioning data packages
> with the number that corresponds to the release date.

This isn't very helpful. There is only ever one version of 
'RTCGA.clinical' available per Bioc version, so whether its version is 
20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.

Probably you want to include the TCGA release in the package _name_, 
'RTCGA.clinical.20151101'. Probably you want to have multiple versions 
available at any one time.

I don't think the experiment data archive is the best solution for 
distributing large collections of curated data. It places a burden on 
our mirrors to sync the repository and on  the svn repository to store 
it. The packages are built twice weekly even though the data is very 
static and in your case based on unchanging base R data structures. The 
data are not very 'granular', even though you've done a good job of 
making the individual data sets accessible, so a user interested in 
ovarian cancers, say, would need to download all data anyway.

Instead I think that these should be ExperimentHub resources. How to add 
resources is described in the vignette to the companion package 
ExperimentHubData

    http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html

The data would be stored in Amazon S3 so globally accessible; it would 
not be under version control. The ExperimentHub / AnnotationHub cache 
would manage local versions, rather than R's package system.

ExperimentHub will be back in active development, including addition of 
new resources, immediately after our next release, May 4, so the timing 
is fairly good.

I think it is also worth while to discuss how you have chosen to 
represent each of the data types, for instance the RNAseq data as a 
samples x genes data.frame whereas the Bioconductor convention would 
store it primarily as a genes x sample matrix embedded in a 
SummarizedExperiment (or at least make it available to the user in that 
form; there are definitely advantages to keeping the serialized instance 
as simple as possible).

Martin Morgan
Biocondcutor

>
> What do you think about such an issue? You can post advices here or on our
> issue list: https://github.com/RTCGA/RTCGA/issues
>
> Thanks for comments,
> Marcin
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.



More information about the Bioc-devel mailing list