[Bioc-devel] making txdb, and propagating metadata from AnnotationHub to GenomicFeatures
Sonali Arora
sarora at fredhutch.org
Thu Jun 18 10:07:22 CEST 2015
Hi Michael, Herve,
On 6/17/2015 9:43 PM, Hervé Pagès wrote:
> Hi Michael,
>
> On 06/17/2015 12:35 AM, Michael Love wrote:
>> Background:
>>
>> With previous approaches that I would recommend to users for building
>> txdb along the way of making count tables, it was desirable that the
>> GTF release information would *automatically* be passed into the
>> metadata of the rowRanges of the SummarizedExperiment.
>>
>> for example, in parathyroidSE, using makeTxDbFromBiomart:
>>
>> ...
>> BioMart database version : chr "ENSEMBL GENES 72 (SANGER UK)"
>> BioMart dataset : chr "hsapiens_gene_ensembl"
>> BioMart dataset description : chr "Homo sapiens genes (GRCh37.p11)"
>>
>> or, alternatively, using makeTxDbFromGFF, it would be present in the
>> name of the GTF file:
>>
>> ...
>> Data source: chr "Homo_sapiens.GRCh37.75.gtf"
>>
>>
>> Question:
>>
>> I'm now interested in switching over to using GTF files available
>> through AnnotationHub, and wondering how we can maintain this
>> automatic propagation of metadata.
>>
>> here's some example code:
>>
>> library(GenomicFeatures)
>> library(AnnotationHub)
>> ah <- AnnotationHub()
>> z <- query(ah, c("Ensembl","gtf","Caenorhabditis elegans","release-80"))
>> stopifnot(length(z) == 1)
>> z$title
>>
>> [1] "Caenorhabditis_elegans.WBcel235.80.gtf"
>>
>> gtf <- ah[[names(z)]]
>> metadata(gtf)
>>
>> $AnnotationHubName
>> [1] "AH47045"
>
> A first (and hopefully easy) improvement would be that the GRanges and
> other objects on AnnotationHub had more information in their metadata().
The GRanges from AnnotationHub now include more information in their
metadata()
slot. This is added in devel (2.1.27).
library(AnnotationHub)
ah = AnnotationHub()
packageVersion('AnnotationHub')
gtf <- query(ah, c('gtf', 'homo sapiens', '77', 'ensembl'))
gr <- gtf[[1]]
gr
metadata(gr)
> metadata(gr)
$AnnotationHubName
[1] "AH28812"
$`File Name`
[1] "Homo_sapiens.GRCh38.77.gtf.gz"
$`Data Source`
[1]
"ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz"
$Provider
[1] "Ensembl"
$Organism
[1] "Homo sapiens"
$`Taxonomy ID`
[1] 9606
Only some of the fields have been added here, but note that all others
can be accessed with -
mcols(ah["AH47045"])
or
ah[metadata(gr)$AnnotationHubName]
Thanks,
Sonali.
>
> In the mean time, here is a workaround (in 2 steps):
>
> 1) Add some useful metadata to 'gtf' (grabbed from the AnnotationHub
> object):
>
> addMetadataFromAnnotationHub <- function(x, ah)
> {
> stopifnot(length(ah) == 1L)
> metadata0 <- list(
> `Data source`=ah$sourceurl,
> `Provider`=ah$dataprovider,
> `Organism`=ah$species,
> `Taxonomy ID`=ah$taxonomyid
> )
> metadata(x) <- c(metadata0, metadata(x))
> x
> }
>
> gtf <- addMetadataFromAnnotationHub(gtf, z)
>
> metadata(gtf)
> # $`Data source`
> # [1]
> "ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz"
> #
> # $Provider
> # [1] "Ensembl"
> #
> # $Organism
> # [1] "Caenorhabditis elegans"
> #
> # $`Taxonomy ID`
> # [1] 6239
> #
> # $AnnotationHubName
> # [1] "AH47045"
>
> 2) Pass the metadata to makeTxDbFromGRanges():
>
> txdb <- makeTxDbFromGRanges(gtf,
> metadata=data.frame(
> name=names(metadata(gtf)),
> value=as.character(metadata(gtf))))
> txdb
> # TxDb object:
> # Db type: TxDb
> # Supporting package: GenomicFeatures
> # Data source:
> ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz
> # Provider: Ensembl
> # Organism: Caenorhabditis elegans
> # Taxonomy ID: 6239
> # AnnotationHubName: AH47045
> # Genome: WBcel235
> # transcript_nrow: 57834
> # exon_nrow: 173506
> # cds_nrow: 131562
> # Db created by: GenomicFeatures package from Bioconductor
> # Creation time: 2015-06-17 12:24:51 -0700 (Wed, 17 Jun 2015)
> # GenomicFeatures version at creation time: 1.21.13
> # RSQLite version at creation time: 1.0.0
> # DBSCHEMAVERSION: 1.1
>
> organism(txdb)
> # [1] "Caenorhabditis elegans"
> taxonomyId(txdb)
> # [1] 6239
>
> Step 2) should be made easier because the metadata is already in 'gtf'
> so there is no reason why the user would need to pass it again thru the
> 'metadata' argument. I'll made that change to makeTxDbFromGRanges().
>
> H.
>
>>
>> txdb <- makeTxDbFromGRanges(gtf)
>> metadata(txdb)
>>
>> name
>> 1 Db type
>> 2 Supporting package
>> 3 Genome
>> 4 transcript_nrow
>> 5 exon_nrow
>> 6 cds_nrow
>> 7 Db created by
>> 8 Creation time
>> 9 GenomicFeatures version at creation time
>> 10 RSQLite version at creation time
>> 11 DBSCHEMAVERSION
>> value
>> 1 TxDb
>> 2 GenomicFeatures
>> 3 WBcel235
>> 4 57834
>> 5 173506
>> 6 131562
>> 7 GenomicFeatures package from Bioconductor
>> 8 2015-06-17 09:31:10 +0200 (Wed, 17 Jun 2015)
>> 9 1.21.13
>> 10 1.0.0
>> 11 1.1
>>
>>
>> sessionInfo()
>>
>> R Under development (unstable) (2015-04-29 r68278)
>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>> Running under: OS X 10.10.3 (Yosemite)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats4 parallel stats graphics grDevices datasets utils
>> [8] methods base
>>
>> other attached packages:
>> [1] AnnotationHub_2.1.26 GenomicFeatures_1.21.13
>> AnnotationDbi_1.31.16
>> [4] Biobase_2.29.1 GenomicRanges_1.21.15 GenomeInfoDb_1.5.7
>> [7] IRanges_2.3.11 S4Vectors_0.7.5 BiocGenerics_0.15.2
>> [10] testthat_0.10.0 knitr_1.10 BiocInstaller_1.19.6
>>
>> loaded via a namespace (and not attached):
>> [1] Rcpp_0.11.6 compiler_3.3.0
>> [3] futile.logger_1.4.1 XVector_0.9.1
>> [5] bitops_1.0-6 futile.options_1.0.0
>> [7] tools_3.3.0 zlibbioc_1.15.0
>> [9] biomaRt_2.25.1 digest_0.6.8
>> [11] RSQLite_1.0.0 memoise_0.2.1
>> [13] shiny_0.12.1 DBI_0.3.1
>> [15] rtracklayer_1.29.10 httr_0.6.1
>> [17] stringr_1.0.0 Biostrings_2.37.2
>> [19] R6_2.0.1 XML_3.98-1.2
>> [21] BiocParallel_1.3.26 lambda.r_1.1.7
>> [23] magrittr_1.5 Rsamtools_1.21.8
>> [25] htmltools_0.2.6 GenomicAlignments_1.5.9
>> [27] SummarizedExperiment_0.1.5 xtable_1.7-4
>> [29] mime_0.3 interactiveDisplayBase_1.7.0
>> [31] httpuv_1.3.2 stringi_0.4-1
>> [33] RCurl_1.95-4.6 crayon_1.3.0
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
--
Thanks and Regards,
Sonali
More information about the Bioc-devel
mailing list