[Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

Marcin Kosiński m.p.kosinski at gmail.com
Tue Apr 26 20:35:11 CEST 2016


I have read from vignette that

2 Adding resources

Resources are contributed to ExperimentHub in the form of a package. The
package contains the resource metadata, man pages, vignette and any
supporting R functions the author wants to provide. This is a similar
design to the existing Bioconductor experimental data packages except the
data are uploaded to AWS S3 buckets instead of stored in a data/ directory
as part of the pacakge.

New packages should be submitted to the Bioconductor tracker and will have
a full review. Contact packages at bioconductor.org for more information.


So If I'd like to provide newer datasets from the newest TCGA release of
data snapshot then I should upload new packages via bioconductor tracker
but in a little different package design than in Experimental Data package.

You said that

*ExperimentHub will be back in active development, including addition of
new resources, immediately after our next release, May 4, so the timing is
fairly good.*

Does it mean I should upload these data packages before May 4th or after?

2016-04-18 20:04 GMT+02:00 Marcin Kosiński <m.p.kosinski at gmail.com>:

>
>
> 2016-04-16 22:55 GMT+02:00 Martin Morgan <martin.morgan at roswellpark.org>:
>
>>
>>
>> On 04/16/2016 01:09 PM, Marcin Kosiński wrote:
>>
>>> Hello,
>>>
>>> I would like to ask you all for an advice in the following issue.
>>>
>>> Last year I have started working with data from The Cancer Genome Atlas.
>>> During that work out team (https://github.com/orgs/RTCGA/people) have
>>> prepared some tools for downloading and integrating datasets from TCGA
>>> study and provided them in the R package called RTCGA
>>> <https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which
>>> is
>>> available on Bioconductor.
>>>
>>> Later on we were working on tools for visualizing and analyzing the most
>>> popular datasets from TCGA so we have prepared data packages with those
>>> datasets and submitted them to Bioconductor in 8 separate packages. You
>>> can
>>> read more about them here http://rtcga.github.io/RTCGA/
>>>
>>> *I have a question about updating those data packages.* TCGA release
>>> datasets snapshots over time. In the RTCGA family of R packages there are
>>> available datasets from the release date 2015-11-01 but currently one can
>>> check that there was newer release 2016-01-28
>>>
>>> tail(RTCGA::checkTCGA('Dates'))
>>>>
>>> [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
>>> "2016-01-28"
>>>
>>> I am wondering whether should we upload newer datasets to those data
>>> packages. We have found that there are great differences in results of
>>> data
>>> analysis depending on from which release date one has took datasets. More
>>> about this issue can be found here:
>>> http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata
>>>
>>> The current state of RTCGA family of R packages is listed below
>>>
>>> RTCGA.clinical
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html
>>> >
>>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>    - BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0
>>>
>>
>>
>>
>>> RTCGA.rnaseq
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html
>>> >
>>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>    - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>>
>>> RTCGA.mutations
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html
>>> >
>>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>    - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>>
>>> ---------------------------------------------------
>>>
>>> RTCGA.methylation
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html
>>> >
>>>    - BiocRelease: NOT YET AVAILABLE
>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1
>>>
>>>
>>> RTCGA.CNV
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html
>>> >
>>>    - BiocRelease: NOT YET AVAILABLE
>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5
>>>
>>>
>>> RTCGA.RPPA
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html
>>> >
>>>    - BiocRelease: NOT YET AVAILABLE
>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6
>>>
>>>
>>> RTCGA.mRNA
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html
>>> >
>>>    - BiocRelease: NOT YET AVAILABLE
>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3
>>>
>>>
>>> RTCGA.miRNASeq
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html
>>> >
>>>    - BiocRelease: NOT YET AVAILABLE
>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4
>>>
>>>
>>> I think that having datasets from the newest snapshot date is vital for
>>> data analysis, but I wouldn't like to create situations in which 2
>>> separate
>>> analysts use RTCGA.clinical and got different results because they used
>>> different data versions. That's why I have started versioning data
>>> packages
>>> with the number that corresponds to the release date.
>>>
>>
>> This isn't very helpful. There is only ever one version of
>> 'RTCGA.clinical' available per Bioc version, so whether its version is
>> 20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.
>>
>> Probably you want to include the TCGA release in the package _name_,
>> 'RTCGA.clinical.20151101'. Probably you want to have multiple versions
>> available at any one time.
>>
>
> Thanks for comments. I haven't considered making separate packages for
> separate data releases.
>
>
>>
>> I don't think the experiment data archive is the best solution for
>> distributing large collections of curated data. It places a burden on our
>> mirrors to sync the repository and on  the svn repository to store it. The
>> packages are built twice weekly even though the data is very static and in
>> your case based on unchanging base R data structures. The data are not very
>> 'granular', even though you've done a good job of making the individual
>> data sets accessible, so a user interested in ovarian cancers, say, would
>> need to download all data anyway.
>>
>> Instead I think that these should be ExperimentHub resources. How to add
>> resources is described in the vignette to the companion package
>> ExperimentHubData
>>
>>
>> http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html
>>
>> The data would be stored in Amazon S3 so globally accessible; it would
>> not be under version control. The ExperimentHub / AnnotationHub cache would
>> manage local versions, rather than R's package system.
>>
>> ExperimentHub will be back in active development, including addition of
>> new resources, immediately after our next release, May 4, so the timing is
>> fairly good.
>>
>
> Thanks for letting me know. I wasn't aware about such solution. I'll have
> a better look at those ExperimentHubs.
>
>
>>
>> I think it is also worth while to discuss how you have chosen to
>> represent each of the data types, for instance the RNAseq data as a samples
>> x genes data.frame whereas the Bioconductor convention would store it
>> primarily as a genes x sample matrix embedded in a SummarizedExperiment (or
>> at least make it available to the user in that form; there are definitely
>> advantages to keeping the serialized instance as simple as possible).
>>
>>
> I've been informed about Bioconductor structures. There is additional
> function RTCGA::convertTCGA (in devel) that transpoze expression data sets
> (rnaseq, miRNASeq, mRNA, methylation, etc) and embs them in ExpressionSet
>
> https://github.com/RTCGA/RTCGA/blob/master/R/convertTCGA.R#L116-L122
>
> Marcin Kosiński,
> RTCGA
>
>
>> Martin Morgan
>> Biocondcutor
>>
>>
>>> What do you think about such an issue? You can post advices here or on
>>> our
>>> issue list: https://github.com/RTCGA/RTCGA/issues
>>>
>>> Thanks for comments,
>>> Marcin
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>> This email message may contain legally privileged and/or confidential
>> information.  If you are not the intended recipient(s), or the employee or
>> agent responsible for the delivery of this message to the intended
>> recipient(s), you are hereby notified that any disclosure, copying,
>> distribution, or use of this email message is prohibited.  If you have
>> received this message in error, please notify the sender immediately by
>> e-mail and delete this email message from your computer. Thank you.
>>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list