[Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

Marcin Kosiński m.p.kosinski at gmail.com
Mon Apr 18 20:04:06 CEST 2016


2016-04-16 22:55 GMT+02:00 Martin Morgan <martin.morgan at roswellpark.org>:

>
>
> On 04/16/2016 01:09 PM, Marcin Kosiński wrote:
>
>> Hello,
>>
>> I would like to ask you all for an advice in the following issue.
>>
>> Last year I have started working with data from The Cancer Genome Atlas.
>> During that work out team (https://github.com/orgs/RTCGA/people) have
>> prepared some tools for downloading and integrating datasets from TCGA
>> study and provided them in the R package called RTCGA
>> <https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which
>> is
>> available on Bioconductor.
>>
>> Later on we were working on tools for visualizing and analyzing the most
>> popular datasets from TCGA so we have prepared data packages with those
>> datasets and submitted them to Bioconductor in 8 separate packages. You
>> can
>> read more about them here http://rtcga.github.io/RTCGA/
>>
>> *I have a question about updating those data packages.* TCGA release
>> datasets snapshots over time. In the RTCGA family of R packages there are
>> available datasets from the release date 2015-11-01 but currently one can
>> check that there was newer release 2016-01-28
>>
>> tail(RTCGA::checkTCGA('Dates'))
>>>
>> [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
>> "2016-01-28"
>>
>> I am wondering whether should we upload newer datasets to those data
>> packages. We have found that there are great differences in results of
>> data
>> analysis depending on from which release date one has took datasets. More
>> about this issue can be found here:
>> http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata
>>
>> The current state of RTCGA family of R packages is listed below
>>
>> RTCGA.clinical
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html
>> >
>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>    - BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0
>>
>
>
>
>> RTCGA.rnaseq
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html
>> >
>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>    - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>
>> RTCGA.mutations
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html
>> >
>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>    - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>
>> ---------------------------------------------------
>>
>> RTCGA.methylation
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html
>> >
>>    - BiocRelease: NOT YET AVAILABLE
>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1
>>
>>
>> RTCGA.CNV
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html
>> >
>>    - BiocRelease: NOT YET AVAILABLE
>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5
>>
>>
>> RTCGA.RPPA
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html
>> >
>>    - BiocRelease: NOT YET AVAILABLE
>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6
>>
>>
>> RTCGA.mRNA
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html
>> >
>>    - BiocRelease: NOT YET AVAILABLE
>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3
>>
>>
>> RTCGA.miRNASeq
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html
>> >
>>    - BiocRelease: NOT YET AVAILABLE
>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4
>>
>>
>> I think that having datasets from the newest snapshot date is vital for
>> data analysis, but I wouldn't like to create situations in which 2
>> separate
>> analysts use RTCGA.clinical and got different results because they used
>> different data versions. That's why I have started versioning data
>> packages
>> with the number that corresponds to the release date.
>>
>
> This isn't very helpful. There is only ever one version of
> 'RTCGA.clinical' available per Bioc version, so whether its version is
> 20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.
>
> Probably you want to include the TCGA release in the package _name_,
> 'RTCGA.clinical.20151101'. Probably you want to have multiple versions
> available at any one time.
>

Thanks for comments. I haven't considered making separate packages for
separate data releases.


>
> I don't think the experiment data archive is the best solution for
> distributing large collections of curated data. It places a burden on our
> mirrors to sync the repository and on  the svn repository to store it. The
> packages are built twice weekly even though the data is very static and in
> your case based on unchanging base R data structures. The data are not very
> 'granular', even though you've done a good job of making the individual
> data sets accessible, so a user interested in ovarian cancers, say, would
> need to download all data anyway.
>
> Instead I think that these should be ExperimentHub resources. How to add
> resources is described in the vignette to the companion package
> ExperimentHubData
>
>    http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html
>
> The data would be stored in Amazon S3 so globally accessible; it would not
> be under version control. The ExperimentHub / AnnotationHub cache would
> manage local versions, rather than R's package system.
>
> ExperimentHub will be back in active development, including addition of
> new resources, immediately after our next release, May 4, so the timing is
> fairly good.
>

Thanks for letting me know. I wasn't aware about such solution. I'll have a
better look at those ExperimentHubs.


>
> I think it is also worth while to discuss how you have chosen to represent
> each of the data types, for instance the RNAseq data as a samples x genes
> data.frame whereas the Bioconductor convention would store it primarily as
> a genes x sample matrix embedded in a SummarizedExperiment (or at least
> make it available to the user in that form; there are definitely advantages
> to keeping the serialized instance as simple as possible).
>
>
I've been informed about Bioconductor structures. There is additional
function RTCGA::convertTCGA (in devel) that transpoze expression data sets
(rnaseq, miRNASeq, mRNA, methylation, etc) and embs them in ExpressionSet

https://github.com/RTCGA/RTCGA/blob/master/R/convertTCGA.R#L116-L122

Marcin Kosiński,
RTCGA


> Martin Morgan
> Biocondcutor
>
>
>> What do you think about such an issue? You can post advices here or on our
>> issue list: https://github.com/RTCGA/RTCGA/issues
>>
>> Thanks for comments,
>> Marcin
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list