[Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

Obenchain, Valerie Valerie.Obenchain at roswellpark.org
Mon May 2 04:52:37 CEST 2016


Hi Marcin,

I can help you add these to ExperimentHub after the release. There are a
few other things I need to tidy so timing will be about mid-May.

Note that the new data format should not be data.frames but instead
follow what we discussed here:

https://tracker.bioconductor.org/issue1335

Valerie


On 04/26/2016 11:35 AM, Marcin Kosiński wrote:
> I have read from vignette that
>
> 2 Adding resources
>
> Resources are contributed to ExperimentHub in the form of a package. The
> package contains the resource metadata, man pages, vignette and any
> supporting R functions the author wants to provide. This is a similar
> design to the existing Bioconductor experimental data packages except the
> data are uploaded to AWS S3 buckets instead of stored in a data/ directory
> as part of the pacakge.
>
> New packages should be submitted to the Bioconductor tracker and will have
> a full review. Contact packages at bioconductor.org for more information.
>
>
> So If I'd like to provide newer datasets from the newest TCGA release of
> data snapshot then I should upload new packages via bioconductor tracker
> but in a little different package design than in Experimental Data package.
>
> You said that
>
> *ExperimentHub will be back in active development, including addition of
> new resources, immediately after our next release, May 4, so the timing is
> fairly good.*
>
> Does it mean I should upload these data packages before May 4th or after?
>
> 2016-04-18 20:04 GMT+02:00 Marcin Kosiński <m.p.kosinski at gmail.com>:
>
>>
>> 2016-04-16 22:55 GMT+02:00 Martin Morgan <martin.morgan at roswellpark.org>:
>>
>>>
>>> On 04/16/2016 01:09 PM, Marcin Kosiński wrote:
>>>
>>>> Hello,
>>>>
>>>> I would like to ask you all for an advice in the following issue.
>>>>
>>>> Last year I have started working with data from The Cancer Genome Atlas.
>>>> During that work out team (https://github.com/orgs/RTCGA/people) have
>>>> prepared some tools for downloading and integrating datasets from TCGA
>>>> study and provided them in the R package called RTCGA
>>>> <https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which
>>>> is
>>>> available on Bioconductor.
>>>>
>>>> Later on we were working on tools for visualizing and analyzing the most
>>>> popular datasets from TCGA so we have prepared data packages with those
>>>> datasets and submitted them to Bioconductor in 8 separate packages. You
>>>> can
>>>> read more about them here http://rtcga.github.io/RTCGA/
>>>>
>>>> *I have a question about updating those data packages.* TCGA release
>>>> datasets snapshots over time. In the RTCGA family of R packages there are
>>>> available datasets from the release date 2015-11-01 but currently one can
>>>> check that there was newer release 2016-01-28
>>>>
>>>> tail(RTCGA::checkTCGA('Dates'))
>>>> [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
>>>> "2016-01-28"
>>>>
>>>> I am wondering whether should we upload newer datasets to those data
>>>> packages. We have found that there are great differences in results of
>>>> data
>>>> analysis depending on from which release date one has took datasets. More
>>>> about this issue can be found here:
>>>> http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata
>>>>
>>>> The current state of RTCGA family of R packages is listed below
>>>>
>>>> RTCGA.clinical
>>>> <
>>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html
>>>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>>    - BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0
>>>>
>>>
>>>
>>>> RTCGA.rnaseq
>>>> <
>>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html
>>>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>>    - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>>>
>>>> RTCGA.mutations
>>>> <
>>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html
>>>>    - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>>    - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>>>
>>>> ---------------------------------------------------
>>>>
>>>> RTCGA.methylation
>>>> <
>>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html
>>>>    - BiocRelease: NOT YET AVAILABLE
>>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1
>>>>
>>>>
>>>> RTCGA.CNV
>>>> <
>>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html
>>>>    - BiocRelease: NOT YET AVAILABLE
>>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5
>>>>
>>>>
>>>> RTCGA.RPPA
>>>> <
>>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html
>>>>    - BiocRelease: NOT YET AVAILABLE
>>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6
>>>>
>>>>
>>>> RTCGA.mRNA
>>>> <
>>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html
>>>>    - BiocRelease: NOT YET AVAILABLE
>>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3
>>>>
>>>>
>>>> RTCGA.miRNASeq
>>>> <
>>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html
>>>>    - BiocRelease: NOT YET AVAILABLE
>>>>    - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4
>>>>
>>>>
>>>> I think that having datasets from the newest snapshot date is vital for
>>>> data analysis, but I wouldn't like to create situations in which 2
>>>> separate
>>>> analysts use RTCGA.clinical and got different results because they used
>>>> different data versions. That's why I have started versioning data
>>>> packages
>>>> with the number that corresponds to the release date.
>>>>
>>> This isn't very helpful. There is only ever one version of
>>> 'RTCGA.clinical' available per Bioc version, so whether its version is
>>> 20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.
>>>
>>> Probably you want to include the TCGA release in the package _name_,
>>> 'RTCGA.clinical.20151101'. Probably you want to have multiple versions
>>> available at any one time.
>>>
>> Thanks for comments. I haven't considered making separate packages for
>> separate data releases.
>>
>>
>>> I don't think the experiment data archive is the best solution for
>>> distributing large collections of curated data. It places a burden on our
>>> mirrors to sync the repository and on  the svn repository to store it. The
>>> packages are built twice weekly even though the data is very static and in
>>> your case based on unchanging base R data structures. The data are not very
>>> 'granular', even though you've done a good job of making the individual
>>> data sets accessible, so a user interested in ovarian cancers, say, would
>>> need to download all data anyway.
>>>
>>> Instead I think that these should be ExperimentHub resources. How to add
>>> resources is described in the vignette to the companion package
>>> ExperimentHubData
>>>
>>>
>>> http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html
>>>
>>> The data would be stored in Amazon S3 so globally accessible; it would
>>> not be under version control. The ExperimentHub / AnnotationHub cache would
>>> manage local versions, rather than R's package system.
>>>
>>> ExperimentHub will be back in active development, including addition of
>>> new resources, immediately after our next release, May 4, so the timing is
>>> fairly good.
>>>
>> Thanks for letting me know. I wasn't aware about such solution. I'll have
>> a better look at those ExperimentHubs.
>>
>>
>>> I think it is also worth while to discuss how you have chosen to
>>> represent each of the data types, for instance the RNAseq data as a samples
>>> x genes data.frame whereas the Bioconductor convention would store it
>>> primarily as a genes x sample matrix embedded in a SummarizedExperiment (or
>>> at least make it available to the user in that form; there are definitely
>>> advantages to keeping the serialized instance as simple as possible).
>>>
>>>
>> I've been informed about Bioconductor structures. There is additional
>> function RTCGA::convertTCGA (in devel) that transpoze expression data sets
>> (rnaseq, miRNASeq, mRNA, methylation, etc) and embs them in ExpressionSet
>>
>> https://github.com/RTCGA/RTCGA/blob/master/R/convertTCGA.R#L116-L122
>>
>> Marcin Kosiński,
>> RTCGA
>>
>>
>>> Martin Morgan
>>> Biocondcutor
>>>
>>>
>>>> What do you think about such an issue? You can post advices here or on
>>>> our
>>>> issue list: https://github.com/RTCGA/RTCGA/issues
>>>>
>>>> Thanks for comments,
>>>> Marcin
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>> This email message may contain legally privileged and/or confidential
>>> information.  If you are not the intended recipient(s), or the employee or
>>> agent responsible for the delivery of this message to the intended
>>> recipient(s), you are hereby notified that any disclosure, copying,
>>> distribution, or use of this email message is prohibited.  If you have
>>> received this message in error, please notify the sender immediately by
>>> e-mail and delete this email message from your computer. Thank you.
>>>
>>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.


More information about the Bioc-devel mailing list