[Bioc-devel] making txdb, and propagating metadata from AnnotationHub to GenomicFeatures

Sonali Arora sarora at fredhutch.org
Thu Jun 18 10:07:22 CEST 2015


Hi Michael, Herve,

On 6/17/2015 9:43 PM, Hervé Pagès wrote:
> Hi Michael,
>
> On 06/17/2015 12:35 AM, Michael Love wrote:
>> Background:
>>
>> With previous approaches that I would recommend to users for building
>> txdb along the way of making count tables, it was desirable that the
>> GTF release information would *automatically* be passed into the
>> metadata of the rowRanges of the SummarizedExperiment.
>>
>> for example, in parathyroidSE, using makeTxDbFromBiomart:
>>
>> ...
>> BioMart database version : chr "ENSEMBL GENES 72 (SANGER UK)"
>> BioMart dataset : chr "hsapiens_gene_ensembl"
>> BioMart dataset description : chr "Homo sapiens genes (GRCh37.p11)"
>>
>> or, alternatively, using makeTxDbFromGFF, it would be present in the
>> name of the GTF file:
>>
>> ...
>> Data source: chr "Homo_sapiens.GRCh37.75.gtf"
>>
>>
>> Question:
>>
>> I'm now interested in switching over to using GTF files available
>> through AnnotationHub, and wondering how we can maintain this
>> automatic propagation of metadata.
>>
>> here's some example code:
>>
>> library(GenomicFeatures)
>> library(AnnotationHub)
>> ah <- AnnotationHub()
>> z <- query(ah, c("Ensembl","gtf","Caenorhabditis elegans","release-80"))
>> stopifnot(length(z) == 1)
>> z$title
>>
>>   [1] "Caenorhabditis_elegans.WBcel235.80.gtf"
>>
>> gtf <- ah[[names(z)]]
>> metadata(gtf)
>>
>>   $AnnotationHubName
>>   [1] "AH47045"
>
> A first (and hopefully easy) improvement would be that the GRanges and
> other objects on AnnotationHub had more information in their metadata().

The GRanges from AnnotationHub now include more information in their 
metadata()
slot. This is added in devel (2.1.27).

library(AnnotationHub)
ah = AnnotationHub()
packageVersion('AnnotationHub')
gtf <- query(ah, c('gtf', 'homo sapiens', '77', 'ensembl'))
gr <- gtf[[1]]
gr
metadata(gr)

 > metadata(gr)
$AnnotationHubName
[1] "AH28812"

$`File Name`
[1] "Homo_sapiens.GRCh38.77.gtf.gz"

$`Data Source`
[1] 
"ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz"

$Provider
[1] "Ensembl"

$Organism
[1] "Homo sapiens"

$`Taxonomy ID`
[1] 9606


Only some of the fields have been added here, but note that all others 
can be accessed with -

mcols(ah["AH47045"])

or

ah[metadata(gr)$AnnotationHubName]


Thanks,
Sonali.
>
> In the mean time, here is a workaround (in 2 steps):
>
> 1) Add some useful metadata to 'gtf' (grabbed from the AnnotationHub
>    object):
>
>   addMetadataFromAnnotationHub <- function(x, ah)
>   {
>     stopifnot(length(ah) == 1L)
>     metadata0 <- list(
>         `Data source`=ah$sourceurl,
>         `Provider`=ah$dataprovider,
>         `Organism`=ah$species,
>         `Taxonomy ID`=ah$taxonomyid
>     )
>     metadata(x) <- c(metadata0, metadata(x))
>     x
>   }
>
>   gtf <- addMetadataFromAnnotationHub(gtf, z)
>
>   metadata(gtf)
>   # $`Data source`
>   # [1] 
> "ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz"
>   #
>   # $Provider
>   # [1] "Ensembl"
>   #
>   # $Organism
>   # [1] "Caenorhabditis elegans"
>   #
>   # $`Taxonomy ID`
>   # [1] 6239
>   #
>   # $AnnotationHubName
>   # [1] "AH47045"
>
> 2) Pass the metadata to makeTxDbFromGRanges():
>
>   txdb <- makeTxDbFromGRanges(gtf,
>               metadata=data.frame(
>                   name=names(metadata(gtf)),
>                   value=as.character(metadata(gtf))))
>   txdb
>   # TxDb object:
>   # Db type: TxDb
>   # Supporting package: GenomicFeatures
>   # Data source: 
> ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz
>   # Provider: Ensembl
>   # Organism: Caenorhabditis elegans
>   # Taxonomy ID: 6239
>   # AnnotationHubName: AH47045
>   # Genome: WBcel235
>   # transcript_nrow: 57834
>   # exon_nrow: 173506
>   # cds_nrow: 131562
>   # Db created by: GenomicFeatures package from Bioconductor
>   # Creation time: 2015-06-17 12:24:51 -0700 (Wed, 17 Jun 2015)
>   # GenomicFeatures version at creation time: 1.21.13
>   # RSQLite version at creation time: 1.0.0
>   # DBSCHEMAVERSION: 1.1
>
>   organism(txdb)
>   # [1] "Caenorhabditis elegans"
>   taxonomyId(txdb)
>   # [1] 6239
>
> Step 2) should be made easier because the metadata is already in 'gtf'
> so there is no reason why the user would need to pass it again thru the
> 'metadata' argument. I'll made that change to makeTxDbFromGRanges().
>
> H.
>
>>
>> txdb <- makeTxDbFromGRanges(gtf)
>> metadata(txdb)
>>
>>                                          name
>>   1                                   Db type
>>   2                        Supporting package
>>   3                                    Genome
>>   4                           transcript_nrow
>>   5                                 exon_nrow
>>   6                                  cds_nrow
>>   7                             Db created by
>>   8                             Creation time
>>   9  GenomicFeatures version at creation time
>>   10         RSQLite version at creation time
>>   11                          DBSCHEMAVERSION
>>                                             value
>>   1                                          TxDb
>>   2                               GenomicFeatures
>>   3                                      WBcel235
>>   4                                         57834
>>   5                                        173506
>>   6                                        131562
>>   7     GenomicFeatures package from Bioconductor
>>   8  2015-06-17 09:31:10 +0200 (Wed, 17 Jun 2015)
>>   9                                       1.21.13
>>   10                                        1.0.0
>>   11                                          1.1
>>
>>
>> sessionInfo()
>>
>>   R Under development (unstable) (2015-04-29 r68278)
>>   Platform: x86_64-apple-darwin12.5.0 (64-bit)
>>   Running under: OS X 10.10.3 (Yosemite)
>>
>>   locale:
>>   [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>>   attached base packages:
>>   [1] stats4    parallel  stats     graphics  grDevices datasets  utils
>>   [8] methods   base
>>
>>   other attached packages:
>>    [1] AnnotationHub_2.1.26    GenomicFeatures_1.21.13 
>> AnnotationDbi_1.31.16
>>    [4] Biobase_2.29.1          GenomicRanges_1.21.15 GenomeInfoDb_1.5.7
>>    [7] IRanges_2.3.11          S4Vectors_0.7.5 BiocGenerics_0.15.2
>>   [10] testthat_0.10.0         knitr_1.10 BiocInstaller_1.19.6
>>
>>   loaded via a namespace (and not attached):
>>    [1] Rcpp_0.11.6                  compiler_3.3.0
>>    [3] futile.logger_1.4.1          XVector_0.9.1
>>    [5] bitops_1.0-6                 futile.options_1.0.0
>>    [7] tools_3.3.0                  zlibbioc_1.15.0
>>    [9] biomaRt_2.25.1               digest_0.6.8
>>   [11] RSQLite_1.0.0                memoise_0.2.1
>>   [13] shiny_0.12.1                 DBI_0.3.1
>>   [15] rtracklayer_1.29.10          httr_0.6.1
>>   [17] stringr_1.0.0                Biostrings_2.37.2
>>   [19] R6_2.0.1                     XML_3.98-1.2
>>   [21] BiocParallel_1.3.26          lambda.r_1.1.7
>>   [23] magrittr_1.5                 Rsamtools_1.21.8
>>   [25] htmltools_0.2.6              GenomicAlignments_1.5.9
>>   [27] SummarizedExperiment_0.1.5   xtable_1.7-4
>>   [29] mime_0.3                     interactiveDisplayBase_1.7.0
>>   [31] httpuv_1.3.2                 stringi_0.4-1
>>   [33] RCurl_1.95-4.6               crayon_1.3.0
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>

-- 
Thanks and Regards,
Sonali



More information about the Bioc-devel mailing list