[Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

Marcin Kosiński m.p.kosinski at gmail.com
Sat Apr 16 19:09:15 CEST 2016


Hello,

I would like to ask you all for an advice in the following issue.

Last year I have started working with data from The Cancer Genome Atlas.
During that work out team (https://github.com/orgs/RTCGA/people) have
prepared some tools for downloading and integrating datasets from TCGA
study and provided them in the R package called RTCGA
<https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which is
available on Bioconductor.

Later on we were working on tools for visualizing and analyzing the most
popular datasets from TCGA so we have prepared data packages with those
datasets and submitted them to Bioconductor in 8 separate packages. You can
read more about them here http://rtcga.github.io/RTCGA/

*I have a question about updating those data packages.* TCGA release
datasets snapshots over time. In the RTCGA family of R packages there are
available datasets from the release date 2015-11-01 but currently one can
check that there was newer release 2016-01-28

> tail(RTCGA::checkTCGA('Dates'))
[1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
"2016-01-28"

I am wondering whether should we upload newer datasets to those data
packages. We have found that there are great differences in results of data
analysis depending on from which release date one has took datasets. More
about this issue can be found here:
http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata

The current state of RTCGA family of R packages is listed below

RTCGA.clinical
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html>
  - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
  - BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0

RTCGA.rnaseq
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html>
  - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
  - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

RTCGA.mutations
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html>
  - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
  - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

---------------------------------------------------

RTCGA.methylation
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html>
  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1


RTCGA.CNV
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html>
  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5


RTCGA.RPPA
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html>
  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6


RTCGA.mRNA
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html>
  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3


RTCGA.miRNASeq
<http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html>
  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4


I think that having datasets from the newest snapshot date is vital for
data analysis, but I wouldn't like to create situations in which 2 separate
analysts use RTCGA.clinical and got different results because they used
different data versions. That's why I have started versioning data packages
with the number that corresponds to the release date.

What do you think about such an issue? You can post advices here or on our
issue list: https://github.com/RTCGA/RTCGA/issues

Thanks for comments,
Marcin

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list