[Bioc-devel] making txdb, and propagating metadata from AnnotationHub to GenomicFeatures
Michael Love
michaelisaiahlove at gmail.com
Thu Jun 18 11:37:34 CEST 2015
that's great Hervé and Sonali!
thanks for the quick response.
best, m
On Thu, Jun 18, 2015 at 10:07 AM, Sonali Arora <sarora at fredhutch.org> wrote:
> Hi Michael, Herve,
>
>
> On 6/17/2015 9:43 PM, Hervé Pagès wrote:
>>
>> Hi Michael,
>>
>> On 06/17/2015 12:35 AM, Michael Love wrote:
>>>
>>> Background:
>>>
>>> With previous approaches that I would recommend to users for building
>>> txdb along the way of making count tables, it was desirable that the
>>> GTF release information would *automatically* be passed into the
>>> metadata of the rowRanges of the SummarizedExperiment.
>>>
>>> for example, in parathyroidSE, using makeTxDbFromBiomart:
>>>
>>> ...
>>> BioMart database version : chr "ENSEMBL GENES 72 (SANGER UK)"
>>> BioMart dataset : chr "hsapiens_gene_ensembl"
>>> BioMart dataset description : chr "Homo sapiens genes (GRCh37.p11)"
>>>
>>> or, alternatively, using makeTxDbFromGFF, it would be present in the
>>> name of the GTF file:
>>>
>>> ...
>>> Data source: chr "Homo_sapiens.GRCh37.75.gtf"
>>>
>>>
>>> Question:
>>>
>>> I'm now interested in switching over to using GTF files available
>>> through AnnotationHub, and wondering how we can maintain this
>>> automatic propagation of metadata.
>>>
>>> here's some example code:
>>>
>>> library(GenomicFeatures)
>>> library(AnnotationHub)
>>> ah <- AnnotationHub()
>>> z <- query(ah, c("Ensembl","gtf","Caenorhabditis elegans","release-80"))
>>> stopifnot(length(z) == 1)
>>> z$title
>>>
>>> [1] "Caenorhabditis_elegans.WBcel235.80.gtf"
>>>
>>> gtf <- ah[[names(z)]]
>>> metadata(gtf)
>>>
>>> $AnnotationHubName
>>> [1] "AH47045"
>>
>>
>> A first (and hopefully easy) improvement would be that the GRanges and
>> other objects on AnnotationHub had more information in their metadata().
>
>
> The GRanges from AnnotationHub now include more information in their
> metadata()
> slot. This is added in devel (2.1.27).
>
> library(AnnotationHub)
> ah = AnnotationHub()
> packageVersion('AnnotationHub')
> gtf <- query(ah, c('gtf', 'homo sapiens', '77', 'ensembl'))
> gr <- gtf[[1]]
> gr
> metadata(gr)
>
>> metadata(gr)
> $AnnotationHubName
> [1] "AH28812"
>
> $`File Name`
> [1] "Homo_sapiens.GRCh38.77.gtf.gz"
>
> $`Data Source`
> [1]
> "ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz"
>
> $Provider
> [1] "Ensembl"
>
> $Organism
> [1] "Homo sapiens"
>
> $`Taxonomy ID`
> [1] 9606
>
>
> Only some of the fields have been added here, but note that all others can
> be accessed with -
>
> mcols(ah["AH47045"])
>
> or
>
> ah[metadata(gr)$AnnotationHubName]
>
>
> Thanks,
> Sonali.
>
>>
>> In the mean time, here is a workaround (in 2 steps):
>>
>> 1) Add some useful metadata to 'gtf' (grabbed from the AnnotationHub
>> object):
>>
>> addMetadataFromAnnotationHub <- function(x, ah)
>> {
>> stopifnot(length(ah) == 1L)
>> metadata0 <- list(
>> `Data source`=ah$sourceurl,
>> `Provider`=ah$dataprovider,
>> `Organism`=ah$species,
>> `Taxonomy ID`=ah$taxonomyid
>> )
>> metadata(x) <- c(metadata0, metadata(x))
>> x
>> }
>>
>> gtf <- addMetadataFromAnnotationHub(gtf, z)
>>
>> metadata(gtf)
>> # $`Data source`
>> # [1]
>> "ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz"
>> #
>> # $Provider
>> # [1] "Ensembl"
>> #
>> # $Organism
>> # [1] "Caenorhabditis elegans"
>> #
>> # $`Taxonomy ID`
>> # [1] 6239
>> #
>> # $AnnotationHubName
>> # [1] "AH47045"
>>
>> 2) Pass the metadata to makeTxDbFromGRanges():
>>
>> txdb <- makeTxDbFromGRanges(gtf,
>> metadata=data.frame(
>> name=names(metadata(gtf)),
>> value=as.character(metadata(gtf))))
>> txdb
>> # TxDb object:
>> # Db type: TxDb
>> # Supporting package: GenomicFeatures
>> # Data source:
>> ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz
>> # Provider: Ensembl
>> # Organism: Caenorhabditis elegans
>> # Taxonomy ID: 6239
>> # AnnotationHubName: AH47045
>> # Genome: WBcel235
>> # transcript_nrow: 57834
>> # exon_nrow: 173506
>> # cds_nrow: 131562
>> # Db created by: GenomicFeatures package from Bioconductor
>> # Creation time: 2015-06-17 12:24:51 -0700 (Wed, 17 Jun 2015)
>> # GenomicFeatures version at creation time: 1.21.13
>> # RSQLite version at creation time: 1.0.0
>> # DBSCHEMAVERSION: 1.1
>>
>> organism(txdb)
>> # [1] "Caenorhabditis elegans"
>> taxonomyId(txdb)
>> # [1] 6239
>>
>> Step 2) should be made easier because the metadata is already in 'gtf'
>> so there is no reason why the user would need to pass it again thru the
>> 'metadata' argument. I'll made that change to makeTxDbFromGRanges().
>>
>> H.
>>
>>>
>>> txdb <- makeTxDbFromGRanges(gtf)
>>> metadata(txdb)
>>>
>>> name
>>> 1 Db type
>>> 2 Supporting package
>>> 3 Genome
>>> 4 transcript_nrow
>>> 5 exon_nrow
>>> 6 cds_nrow
>>> 7 Db created by
>>> 8 Creation time
>>> 9 GenomicFeatures version at creation time
>>> 10 RSQLite version at creation time
>>> 11 DBSCHEMAVERSION
>>> value
>>> 1 TxDb
>>> 2 GenomicFeatures
>>> 3 WBcel235
>>> 4 57834
>>> 5 173506
>>> 6 131562
>>> 7 GenomicFeatures package from Bioconductor
>>> 8 2015-06-17 09:31:10 +0200 (Wed, 17 Jun 2015)
>>> 9 1.21.13
>>> 10 1.0.0
>>> 11 1.1
>>>
>>>
>>> sessionInfo()
>>>
>>> R Under development (unstable) (2015-04-29 r68278)
>>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>>> Running under: OS X 10.10.3 (Yosemite)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats4 parallel stats graphics grDevices datasets utils
>>> [8] methods base
>>>
>>> other attached packages:
>>> [1] AnnotationHub_2.1.26 GenomicFeatures_1.21.13
>>> AnnotationDbi_1.31.16
>>> [4] Biobase_2.29.1 GenomicRanges_1.21.15 GenomeInfoDb_1.5.7
>>> [7] IRanges_2.3.11 S4Vectors_0.7.5 BiocGenerics_0.15.2
>>> [10] testthat_0.10.0 knitr_1.10 BiocInstaller_1.19.6
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Rcpp_0.11.6 compiler_3.3.0
>>> [3] futile.logger_1.4.1 XVector_0.9.1
>>> [5] bitops_1.0-6 futile.options_1.0.0
>>> [7] tools_3.3.0 zlibbioc_1.15.0
>>> [9] biomaRt_2.25.1 digest_0.6.8
>>> [11] RSQLite_1.0.0 memoise_0.2.1
>>> [13] shiny_0.12.1 DBI_0.3.1
>>> [15] rtracklayer_1.29.10 httr_0.6.1
>>> [17] stringr_1.0.0 Biostrings_2.37.2
>>> [19] R6_2.0.1 XML_3.98-1.2
>>> [21] BiocParallel_1.3.26 lambda.r_1.1.7
>>> [23] magrittr_1.5 Rsamtools_1.21.8
>>> [25] htmltools_0.2.6 GenomicAlignments_1.5.9
>>> [27] SummarizedExperiment_0.1.5 xtable_1.7-4
>>> [29] mime_0.3 interactiveDisplayBase_1.7.0
>>> [31] httpuv_1.3.2 stringi_0.4-1
>>> [33] RCurl_1.95-4.6 crayon_1.3.0
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>
> --
> Thanks and Regards,
> Sonali
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list